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Dimensional Problems of Criteria 


Edwin E. Ghiselli 


University of California 


The discussions by Otis (8), Toops (12), 
Bellows (1), and Thorndike (11) constitute 
the fundamental conceptual formulations of 
the criterion problem, yet all have been pub- 
lished within the last decade and a half. 
While others have considered matters con- 
nected with the development of criteria for 
use in validating selective devices, by and 
large their concern has been with technical 
details or some restricted phase. Thus the 
broader aspects of the criterion problem have 
not received attention they deserve. 

In a way it is unfortunate that the term 
criteria has been used to denote measurements 
of job success. This term refers to standards 
for the evaluation of something else, and the 
implication is that the something else is of 
greater importance than the standards them- 
selves. In the present context selective de- 
vices are the something else. It is certainly 
true that far more attention has been devoted 
to the development of predictive devices than 
to the understanding and evaluation of cri- 
teria. 

Since criteria are means for quantitatively 
describing workers’ performance, an examina- 
tion of the dimensional problems of criteria 
would seem both legitimate and necessary. 
While such an examination may raise new 
and as yet unanswerable questions, at least 
those who are concerned with the selection of 
workers will be in a better position to see the 
kinds of problems that confront them. This 
paper will deal with certain matters con- 
nected with the dimensionality of criteria— 
static dimensionality, dynamic dimensionality, 
and individual dimensionality. 


Static Dimensionality 


As it is ordinarily stated, the selection prob- 
lem involves the prediction of a single vari- 
able—the criterion. The presumption is that 
the job proficiency of workers can be de- 
scribed completely by a single dimension. 
But for almost any job there are a number 
of dimensions on which workers’ performance 
can be measured. Thus typists can be 
evaluated not only in terms of speed of typ- 
ing, but also in terms of errors, neatness of 
product, absences, etc. When confronted 
with this situation the common procedure is 
to select one of the criterion variables and 
say that it is the best, the most pertinent, or 
the most representative. 

When there are several criterion variates 
for a given job, sometimes the decision is to 
combine all measures into a single composite. 
This presents a series of technical problems 
and also a series of theoretical ones. Mere 
assignment of equal weight to all components 
is seldom satisfactory, because other grounds, 
perhaps purely intuitive ones, suggest that 
they are not equally important. Almost all 
of those who have attempted rational solu- 
tions to the differential weighting of a set of 
independent variables fall back upon some 
notion of a general factor. The objectives 
may be stated differently. For example, 
Edgerton and Kolbe (4) and Horst (6) 
say that the purpose is to maximize differ- 
ences between individuals in terms of com- 
posite criterion scores and to minimize dif- 
ferences in scores on the different criterion 
variables within the individual. However, 
the end result is the same, and the various 
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criterion variates are thereby weighted in 
terms of their principal component. 

This would be a satisfactory solution if it 
could be demonstrated that all of the differ- 
ent measures of performance for any given 
job are determined largely by the same gen- 
eral factors. However, such evidence as there 
is suggests that if there are general factors at 
work, at best they are of minor importance 
(e.g., 2, 9). In other words, it would ap- 
pear that workers’ performance on any given 
job is best described in terms of several di- 
mensions, and one dimension is not sufficient. 

If the proposition is accepted that criteria 
are multidimensional with the dimensions be- 
ing independent, or at least relatively so, then 
the situation is not an easy one. There is 
no way to combine the independent scores of 
an individual into a single value! that will 
describe him uniquely. Rather it will be 
necessary to locate his position in the multi- 
dimensional criterion space. This can be ac- 
complished in either of two ways: each cri- 
terion dimension can be predicted separately 
and the individual’s position in the space 
estimated, or the space can be divided into 
parts and that portion of the space in which 
the individual is most likely to fall could be 
estimated by the discriminate function. 

These solutions require judgments as to 
which portions of the space contain most de- 
sirable individuals. When a single criterion 
variable is being predicted, the problem is 
simple. All that needs be said in making a 
decision with respect to candidates for a job 
is that the higher the score the better. But 
with the multidimensional situation it is nec- 
essary to define which parts of the space con- 
tain satisfactory persons who should be hired, 
and which parts contain those individuals 
who should be rejected. It might be argued 
that those people who are high on every cri- 
terion variate are successful and hence should 
be selected. The criterion space then would 
be divided into two parts, one at the upper 
right-hand corner containing only individuals 
who are high on all variables and therefore to 
be classed as satisfactory, and all of the re- 
maining area of the space containing all of 
the rest of the individuals who would then 
be classified as unsatisfactory. Persons who 


fall into this second part would necessarily 
be classified as unsatisfactory even if they 
were high on all but one of the dimensions. 

Procedures such as these clearly presume 
that all criterion dimensions are equally im- 
portant and that a low score on any one. is 
tantamount to complete failure. This situa- 
tion may hold in certain circumstances but 
probably not too many. In most cases the 
notion of equality among criterion dimensions 
cannot be supported, and compensation for 
low scores on one dimension must be allowed 
for by high scores on another. 

Kurtz has provided a rational solution of a 
very different kind (7). He has proposed 
that when criteria are multidimensional they 
need not be combined, nor need the indi- 
viduals’ unique positions in the multidimen- 
sional space be described. Rather predictors 
can be weighted so that the highest possible 
average of the validity coefficients with all of 
the criterion variates is obtained. This solu- 
tion simply ignores the problem of differential 
importance of criteria. Yet a kind of com- 
promise is effected which in many situations 
may be as good as or even better than arbi- 
trarily combining all criterion variates into 
an equally weighted composite and predict- 
ing it. 

Undoubtedly there are still other kinds of 
solutions to the multivariate criterion prob- 
lem, but few have given it systematic con- 
sideration. It is obvious that new ideas are 
necessary and that ingenuity should be the 
order of the day. 


Dynamic Criteria 


The foregoing criteria have been called 
static since in dealing with them the matter 
of change is not considered. For any given 
type of criterion, data are collected for a pe- 
riod of time and then merely summed. Yet 
it is apparent that the performance of work- 
ers does change as they learn and develop on 
the job. The length of time during which 
such change occurs generally seems to be un- 
derestimated. The tendency is to think that 
most improvement occurs within the first few 
weeks or months after the individual is placed 
on the job. However, increases in job pro- 
ficiency have been noted over quite long pe- 
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riods of time. Farmer and Chambers report 
significant improvements in the performance 
of bus drivers even after five years on the 
job (5). Haire and the present writer found 
that the performance of investment salesmen 
improved at a constant rate during the first 
six years of employment with no suggestion 
of a leveling off. 

The obvious thing to do is to examine 
the intercorrelations among criterion meas- 
urements taken at different periods of time 
in order to ascertain the kind of pattern that 
holds. For example, analysis of the inter- 
correlations among monthly production rec- 
ords for a period of several years would give 
some indication of the extent to which di- 
mensionality changes with time. 

Pertinent facts are few, but generalizing 
from the results of laboratory experiments on 
learning one would expect to find the inter- 
correlations among measures of proficiency 
taken at different periods fairly uniform in 
magnitude, with the relationships among ex- 
treme periods perhaps being somewhat lower 
than the relationships among near periods. 
Haire and the present writer found just this 
state of affairs holding for the intercorrela- 
tions among the monthly sales of investment 
salesmen for a two-year period, and for the 
intercorrelations among the weekly produc- 
tion records of taxicab drivers for an 18-week 
period. This uniformity in magnitude of in- 
tercorrelations is most easily accounted for on 
the basis of a single set of general factors 
equally important throughout time. 

If this be the true state of affairs, it would 
mean that the selection problem is quite 
simple. When those individuals who are su- 
perior in the early phases are differentiated 
accurately, those who will be superior later 
on are thereby located, since they are the 
same individuals. However, even if the in- 
tercorrelations among criteria taken at dif- 
ferent times are exactly the same in magni- 
tude, since they will not be perfect they could 
be accounted for just as well by a variety of 
different factors changing in importance as 
time passes. 

For the taxicab drivers described earlier, 
the scores earned by the men on a series of 
tests administered at the time of hiring were 


available. The scores on the various tests 
were correlated with production on each of 
the first 18 weeks of employment. If the 
uniform correlations among production rec- 
ords for the various weeks were the result of 
general factors of uniform importance, then 
the validities of the tests should be the same 
throughout the entire period. While for some 
of the tests the magnitude of the validity co- 
efficients did remain practically unchanged 
throughout the period, other tests showed a 
gradual reduction in validity, still others a 
gradual increase, and one even showed regular 
and significant cyclical changes in validity 
from about zero to .40. 

If this be the state of affairs with criteria, 
then the prediction problem is a difficult one 
indeed. The general predictions of job suc- 
cess that one desires to make, predictions of 
performance whether success occurs early or 
late in employment, would necessarily be 
relatively poor. To be of substantial magni- 
tude, predictions would have to be of per- 
formance fairly closely pin-pointed in time. 


Criterion Dimensionality of the Individual 


Some 15 years ago Otis made a most chal- 
lenging statement about the criterion prob- 
lem (8). In effect, he said that several 
workers on the same job might be consid- 
ered equally good and yet the nature of their 
contributions to their organization might be 
quite different. In other words, the idea is 
that workers on the same job might be 
evaluated in terms of different criterion di- 
mensions. Thus one college professor is con- 
sidered good because he is an _ excellent 
teacher, and another because his research is 
very significant. This is not saying that an 
individual is to be considered good if he is 
high on any one criterion variable. Rather 
the notion is that while certain criterion vari- 
ables are appropriate in describing the per- 
formance of some workers, they just are not 
pertinent in describing the performance of 
other workers on exactly the same job. 

It would appear, therefore, that the cri- 
terion dimensionality of the individual should 
be investigated in the same way that Stephen- 
son (10), Cattell (3), and others, investigate 
the personality dimensionality of the indi- 
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vidual. It is quite possible that workers 
assigned to the same job perform quite differ- 
ently in a qualitative as well as in a quantita- 
tive sense. For example, one clerk in a de- 
partment store may perceive his job as seller 
of the merchandise assigned to him. An- 
other sales clerk may view his job as builder 
of general good will. The number of dollars 
the store receives as a result of the efforts of 
the two workers might be exactly the same, 
in the one case because the clerk himself sells 
a lot of merchandise, and in the other case 
because the clerk gets the customers to buy 
throughout the entire store. Under these cir- 
cumstances, the factors leading to successful 
performance of the one kind would be quite 
different from those leading to successful per- 
formance of the other kind. It follows, then, 
that different types of tests necessarily would 
have to be used in order to predict the two 
different kinds of performances. 

It might be argued that what is being con- 
sidered here is not one job but two. How- 
ever, in an administrative sense there is just 
one job, and it is only in a psychological 
sense that the jobs are qualitatively different. 
Studies of criterion dimensionality of the in- 
dividual are one way of determining whether 
different positions in the same job in fact are 
psychologically the same or different. 


Some Conclusions Concerning the Dimension- 
ality of Criteria 


The matters discussed here are merely 
translations and formalizations of the kinds 
of problems that are commonly raised in con- 


nection with criteria. They are embarrassing 
and confusing questions, but nonetheless le- 
gitimate, and the psychologist should be 
prepared to provide answers to them. The 
questions are termed embarrassing because 
satisfactory answers have not been provided, 
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and confusing because their full implications 
and the possible scope of answers are not well 
understood. 

The evaluation of selective devices merely 
by simple correlations with single criterion 
variables is insufficient. It is apparent that 
the description of workers’ job performance 
is a complex matter. Satisfactory statements 
concerning validity, therefore, cannot be made 
until rational solutions are developed for the 
various dimensional problems of criteria. 


Received February 14, 1955. 
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A Biographical Inventory for Students: 
I. Construction and Standardization of the Instrument '’’ 


Laurence Siegel 


Miami University 


Psychologists in the clinical, counseling, and 
industrial areas have available an assortment 
of psychometric devices designed to facilitate 
the understanding of present behavior and 
the prediction of subsequent behavior. These 
devices, however, are not by any means the 
sole sources of information. Applied and 
theoretical psychologists alike seek not only 
a description of behavior as it exists at the 
present moment, but also an explanation of 
how it got that way. In the search for pre- 
disposing factors, there can be little quarrel 
with Guthrie’s statement that “The systems 
of habit that make up identifiable personality 
traits are imposed on the individual through 
his learned adaptation to his family, his call- 
ing, his culture, in general the exigencies of 
his environment” (5, p. 61). 

Although the value of biographical infor- 
mation concerning work history, age, marital 
status, etc. is widely accepted (witness the 
administration of application blanks in in- 
dustry and personal history forms in the 
clinic), the systematic evaluation of such in- 
formation is often lacking. The application 
blank and interview as interpreted by the 
psychologically unsophisticated is much too 
subjective to satisfy the industrial or clinical 
psychologist. Quantification of such personal 
history data and subsequent validation studies 
have appeared in the literature from time to 
time, although it was not until World War II 
that the biographical item was more thor- 
oughly investigated (2, 3, 4, 6, 7). 

In general, research with this type of item 
has indicated that biographical information 

1 A large portion of this paper was abstracted from 
the writer’s dissertation submitted to the University 
of Tennessee in March, 1952, in partial fulfillment of 
the requirements for the Ph.D. degree. Additional 
research reported in this paper was completed while 
on the staff at the State College of Washington. 

21 wish to express my gratitude to Professor Ed- 
ward E. Cureton for his constant guidance through- 
out the conduct of this research, and to the James 


McKeen Cattell Fund for partial subsidization of the 
analysis of data. 


blanks based upon specific job analyses can 
be constructed with validity. The appli- 
cability of these instruments, however, has 
been limited. Since previous research in this 
area has yielded encouraging results, it seemed 
appropriate to construct a biographical inven- 
tory that had validity not for just one cri- 
terion, but with potential validity for several 
criteria provided that the subjects are com- 
parable to the standardization group. 

It becomes clear as soon as one compiles 
biographical items that they are not homo- 
geneous. The basic problem in constructing 
an instrument of this type is to uncover ho- 
mogeneous subsets of items and to standard- 
ize these subsets for a given group of sub- 
jects. More specifically, we were concerned 
with constructing a biographical inventory 
containing homogeneous subsets of items suit- 
able for administration to male high school 
seniors and college freshmen. 

This paper is the first in a series of two. 
It will describe the procedures employed and 
results obtained during the construction and 
standardization of the Biographical Inventory 
for Students (BIS). The second paper in the 
series will describe a group of investigations 
designed to validate the BIS against a variety 
of criteria, including indices of personality 
and ability. 


Method 
Subjects 


Results obtained from administration of the BIS 
to four major groups of Ss will be described in this 
paper. Sample I consisted of 691 male high school 
seniors. The high schools from which the subjects 
were selected were representative of the middle socio- 
economic level in the opinion of the persons obtain- 
ing the samples. No attempt was made to obtain a 
truly random sample. Rather, we obtained “spot 
samples” from the vicinities of Seattle, Washington 
(N = 168); Spokane, Washington (N= 85); Lin- 
coln, Nebraska (N = 333); and Orlando, Florida (N 
= 105). 

These subjects were divided into two groups by as- 
signing every seventh case to Group B. The data 
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from Group A (N = 591) were utilized for the pur- 
pose of arriving at stable subscales and for the com- 
putation of subscale norms. Group B (N= 100) 
was a hold-out group utilized for the purpose of 
checking the independence of the final subscales. 

As a further check upon the independence of the 
subscales, the BIS was administered to a second 
sample (II) consisting of 66 superior male freshmen 
attending The Ohio State University. 

The homogeneity of the subscales was ascertained 
from the responses of 154 entering freshmen at the 
State College of Washington (sample III). The BIS 
was administered to this sample as part of a bat- 
tery of tests administered routinely during freshmen 
orientation week. 

A fourth sample (IV) consisting of 53 seniors at- 
tending Franklin High School (Portland, Oregon) 
provided information for the estimation of retest re- 
liability. 


Procedure and Results 


The analyses to be described consisted of 
five phases, each of which will be presented 
as a separate unit. 


Phase 1. Item Construction and Derivation 
of Subscales 


Ninety-seven multiple-choice items, all but two of 
which had five alternatives, were prepared in the 
areas of the examinee himself, his parents, friends, 
and siblings. Since each alternative was item-ana- 
lyzed separately, the inventory may be considered 
to consist of 483 items. These biographical items 
were constructed on the basis of suggestive essays 
written by students in a course in elementary psy- 
chology, and a biographical inventory developed by 
Richardson, Bellows, Henry and Co. (New York) in 
addition to the usual procedure of writing items 
based upon the test author’s own experience. 

It seemed desirable to make the BIS as nearly self- 
administering as possible. Therefore the instruction 
sheet contains comprehensive instructions regarding 
the fact that it is possible to mark more than one 
answer for some of the items, etc. Administration of 
the instrument to Sample I (N = 691) demonstrated 
that one hour is sufficient for completion of response. 

Two statistical approaches to the problem of de- 
riving homogeneous and independent subscales were 
considered. The first, factor analysis, would have 
required 116,503 item intercorrelations and an un- 
wieldy matrix. We concluded that a second pro- 
cedure, that of iterative item analysis, would vield 
essentially the same result and yet be within the 
scope of manageable research. The method of itera- 
tion was a modification of that described by Wherry 
and Gaylord (9). 

In brief, our procedure involved the following 
steps. 


1. Preliminary allocation of items into “clusters” 
by judges. These preliminary clusters served as the 


starting point for the analysis. They were formed 
with the realization that the composition of each 
cluster would change (perhaps radically) after item 
analysis. 

2. Correlation of each item with the total score 
on each of these preliminary clusters. 

3. Reconstitution of the clusters on the basis of 
the correlations obtained at Step 2. 

4. Recorrelation of each of the items with the to 
tal score on each of the reconstituted clusters. 


These sfeps were to be repeated until cluster sta- 
bility was achieved. 

It seemed desirable to have the preliminary clusters 
determined by psychologists in the areas of psycho 
metrics and adolescent psychology. Accordingly, ten 
judges? allocated each of the items to one of the 
following seven categories: heterosexual activities, 
emancipation from the home, social maturity, eco- 
nomic independence, intellectual maturity, maturity 
in use of leisure time, miscellaneous. 

We established an arbitrary criterion of 50 per cent 
agreement among judges for the allocation of an 
item to one of these preliminary clusters. It became 
apparent, however, that the original seven cluster 
titles were in need of revision, as evidenced by the 
fact that the analysis of the allocations by judges 
dictated the placement of 211 of the original items 
into the “miscellaneous” category. In order to re- 
duce the number of iterations ultimately required, 
the cluster titles were revised in the direction of in- 
creased specificity, and the items were reallocated to 
11 new clusters on the basis of the original judg- 
ments. The revised cluster titles which permitted 
such a reduction in the number of “miscellaneous” 
items were: Act—Action; Soc—Social Activities: 
Het—Heterosexual Activities; Rlg—Religious Activi 
ties; LMA—Literature, Music, and Art; Pol—Po- 
litical Activities; SEL—Socioeconomic Status; Eco- 
Economic Independence; Dep—Dependence upon 
the Home; Con—Social Conformity; Ph—Physical 
Health; and Miscellaneous. 

The item responses of sample IA (N = 591) were 
scored 1 if marked by S and 0 if the item was not 
marked. Every item was then correlated with the 
total score on each of these 11 clusters. (The “Mis- 
cellaneous” category cannot properly be considered 
as constituting a cluster.) New scales were then 
constructed on the basis of these correlations, every 
item was recorrelated with each of the new scales 
and the process was continued until the allocations 
became stable. This required two complete itera- 
tions, each of which necessitated-the computation of 
5313 biserial correlations (483 items with each of 11 
criteria) .4 


3 The participation of the following judges is grate- 
fully acknowledged: E. E. Cureton, Louise W. Cure- 
ton, D. H. Fryer, R. H. Gaylord, G. F. Kuder, Lu- 
ella C. Lowie, Grace Manson, E. O. Milton, H. H 
Remmers, and A. G. Wesman. 

*An IBM procedure for computation of biserial 
correlations has been described by Siegel and Cure- 


ton (8). 
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The 11 clusters hypothesized prior to the iterative 
analysis were substantiated by the data, although 
many of the original allocations of items were dem- 
onstrated to be incorrect according to the statistical 
criteria. 

The general procedure employed in determining 
correct cluster allocations based upon the first itera- 
tion was to limit consideration to those items with 
p values between .03 and .97. Furthermore, items 
that had approximately equal correlations with more 
than two clusters were not allocated. 

The final allocation of items based upon the sec- 
ond iteration utilized items with p values between 
02 and .98. Only one allocation for each item was 
permitted on the basis of item-cluster correiations, 
even though some of the items correlated substan- 
tially with more than one cluster. Items which cor- 
related equally well with more than two clusters 
were discarded from the scoring key. 

The criterion for significance of item-cluster cor- 
relations in both iterations was that the obtained 
correlation exceed by three times or more the stand- 
ard error of a biserial correlation of zero when N = 
591. (The direction of the correlation dictated the 
assignment of positive or negative weights to each 
item.) 

An indication of the composition of the 11 sub- 
scales derived from this analysis may be obtained 
from an examination of a few of the items which 
correlated most highly with each cluster. (The cor- 
relation is indicated in parentheses after the state- 
ment of the item.) 


Act Action 
Father has taught me to fish or enjoy some 
other individual sport (.89) 
I read the sports section of the newspaper every 
day (.76) 

Soc Social Activities 
I know how to mix cocktails (.57) 
I have taken a trip of more than 100 miles with 
friends (.55) 

Het Heterosexual Activities 
I never have dates (— .90) 
During the course of a date I generally kiss the 
girl (.72) 

Rig Religious Activities 
Father is active in a Church group (.81) 
I devote a good deal of time and energy to re- 
ligious activities (.77) 


LMA Literature, Music and Art 
I frequently take part in dramatics (.76) 
I frequently attend concerts (.74) 
Pol Political Activities 
Mother is active in a political group (.64) 
I discuss politics with my father (.59) 
SEL Socioeconomic Status 
We have a vacuum cleaner in my home (.88) 
Eco Economic Independence 
I use my earnings for school expenses (.59) 
When I need money, I get it from a member of 
the family, but I don’t have an allowance 
(— .58) 
Dep Dependence upon the Home 
My whole family goes out together about once 
a week (.68) 
I am permitted to go out in the evening on any 
day of the week (— .59) 
Con Social Conformity 
I object to girls swearing (.86) 
I object to girls wearing highly colored nail- 
polish (.84) 
Ph Lack of Physical Health 
I have no disabilities (— .79) 


The number of items keyed for each of the sub- 
scales and the range of scores actually obtained on 
each subscale (Sample I) are summarized in Table 1. 


Ninety-five of the original 483 items are 
not included in the final scoring keys. These 
items were eliminated for one of the follow- 
ing three reasons: p was not between .02 and 
.98; the item did not correlate significantly 
with any of the subscales; the item corre- 
lated equally with more than two subscales. 

It is apparent from examination of Table 1 
that the subscales differ greatly with respect 
to the number of constituent items and there- 
fore with respect to the range of scores theo- 
retically possible. Soc, for example, has a 
theoretical range of 55 points, whereas Ph 
has a theoretical range of only 17 points. 
Since Ph was so constricted, and the range 
obtained with college samples was even nar- 
rower than that obtained from high school 
samples, this subscale was dropped from the 
BIS and has never been validated. 


Table 1 


Summary of Theoretical and Obtained Range of Scores on BIS Subscales 








Act Soc 
Number of items keyed + 
Number of items keyed — 3 13 8 
Obtained range 2:30 —6:23 —4:19 


Het Rig 
330k (‘és 18 
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LMA Pol SEL Eco Dep Ph 
34 21 »«=—«28 4. 13, 
10 7 @ 2. 4 3 


—5:20 —5:17 —9:9 —3:15 —4:29 —2:13 —3:9 
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Table 2 


Intercorrelations Between Subscales Obtained from Two Samples 
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Phase 2. Independence of the Subscales 


A major objective of this research was to 
obtain relatively independent subscales of bio- 
graphical items. This independence was veri- 
fied by computing Pearsonian correlations be- 
tween the subscale scores of Ss in Sample IB 
and in Sample II. The matrix of intercorre- 
lations is presented in Table 2. (Results 
from Sample IB appear below the diagonal, 
Sample II above the diagonal.) 

If the two samples upon which the results 
in Table 2 are based were representative of 
the populations from which they were drawn, 
it would appear that the BIS subscales are 
more highly independent for college freshmen 
than for high school seniors. However the 
size of the samples prohibits such a conclu- 
sion. It does appear that the BIS subscales 
are, in fact, relatively independent. Only 14 
of the 45 correlations in the matrix for Sam- 
ple IB and 6 of the correlations in the matrix 
for Sample II are significantly different from 
zero (p < .02). Such overlap in variance is 
not alarming. Only three pairs of subscales 
are significantly correlated in both samples: 
Het with Soc, Het with Eco, and Act with 
Dep. 


Phase 3. Homogeneity of Subscales 


The BIS was administered to Sample III 
and each subscale was divided into approxi- 
mately parallel forms based upon the per- 
centage of respondents marking the item and 
the item validity. Split-test correlations were 


then computed for each of the subscales. 
These correlations (corrected for length of 
test by application of the Spearman-Brown 
formula) are presented in Table 3. Although 
these correlations might be construed as esti- 
mates of subscale reliability, we feel that such 
an interpretation would be in error. A split- 
test estimate of reliability is appropriate for 
instruments wherein homogeneity can be as- 
sumed (e.g., achievement tests). In the case 
of such instruments it is reasonable to expect 
each item to be measuring whatever the other 
items are measuring and therefore to treat 
split-test correlations as estimates of consist- 
ency of response with a zero time interval be- 
tween administrations. 

When homogeneity cannot be assumed on 
an a priori basis, however, split-test correla- 


Table 3 


Split-Test Correlations for BIS Subscales 


Subscale r 
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Dep 
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Table 4 
Retest Reliabilities 


Subscale 


Act 

Soc 82 
Het 87 
Rig 

LMA 18 
Pol 86 
SEL 86 
Eco 80 
Dep 87 
Con 

Mean subscale reliability 84 


tions serve as an empirical demonstration of 
the degree of such homogeneity, and the re- 
test method is more appropriate for the esti- 
mation of reliability (conceived of as con- 
sistency of response). Thus it appears from 
the data in Table 3 that the homogeneity of 
each of the subscales, with the exception of 
Eco, is satisfactory. Furthermore, when con- 
sidered together, the data from Tables 2 and 
3 appear to indicate that one is warranted in 
scoring the 10 BIS subscales as independent 
variables. 


Phase 4. Retest Reliability 


The BIS was administered twice to Sample IV for 
the purpose of estimating retest reliability. The 
time interval between administrations was six weeks 
for 24 of these students and two weeks for the re- 
maining 29 students. The resultant retest correla- 
tions are cited in Table 4. : 

These retest reliabilities for all subscales 
with the exception of Con are acceptable in 
view of the reliabilities characteristically re- 
ported for personality-type inventories, but 
the data in Table 4 must be interpreted cau- 
tiously. It would be highly desirable to ob- 
tain confirmation of these estimates of reli- 
ability with a larger sample of Ss than that 
employed for this phase of the investigation. 

The low reliability of the Con subscale may 
be a function of the limited number of items 
it contains. It is interesting to note that this 
subscale appeared to be quite reliable for that 
portion of the sample retested after a two- 
week interval and highly unreliable for that 
portion retested after six weeks, whereas the 


other scales were equally reliable for both 
subsamples. 


Phase 5. Norms 


Percentile norms were computed for Sample 
IA and a total of 488 college freshmen (con- 
sisting in part of Sample III). T ratios for 
significance of differences between mean sub- 
scale scores of the regional subgroups consti- 
tuting Sample IA indicate that proper inter- 
pretation of BIS scores is dependent upon the 
use of appropriate regional norms, particu- 
larly for Act, Rig, and SEL. 

The differences between scores earned on 
the BIS subscales by high school and college 
samples are even more startling than the re- 
gional differences among high school students. 
The college sample scored significantly higher 
(p < .02) on Soc, Het, Pol, and SEL, and 
significantly lower than the high school sam- 
ple on Rig, Eco, and Con. The direction of 
these differences appears reasonable in view 
of our knowledge of the selective factors op- 
erating as determinants for continuation of 
education beyond the high school level (and 
might be interpreted as supportive of sub- 
scale validity). The implication of these dif- 
ferences for utilization of separate norms for 
high school and college groups is rather ob- 
vious. 


Summary 


The construction and standardization of a 
self-administering biographical inventory of 
possible interest to psychologists in the coun- 
seling and clinical areas has been described. 
A modification of iterative procedure was 
found to be satisfactory for the development 
of 10 subscales which are relatively homoge- 
neous, independent, and reliable. 

One of the major advantages of the BIS as 
an adjunct to counseling is that it consists of 
factual questions that are assumed to be less 
productive of distorted response or unpleas- 
ant emotional associations than items usually 
constituting personality inventories. In addi- 
tion to providing meaningful subscale scores, 
the BIS provides the counselor with item re- 
sponses comparable to those obtained on many 
of the “Personal History Forms” currently 
administered as a matter of routine. These 
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advantages are not, however, obtained at the 
sacrifice of validity. (The second paper in 
this series will summarize a sequence of in- 
vestigations designed to validate each of the 
subscales.) 

The major deficiencies of the BIS at the 
present time are: (@) norms are based upon 
rather small samples (591 high school seniors 
and 488 college freshmen); (b) the present 
form of the inventory is suitable for adminis- 
tration to males only. The first of these de- 
ficiencies will be rectified as more data become 
available. A BIS suitable for administration 
to females will be developed only after the 
present form for males has been demonstrated 
to be a worthwhile addition to the arma- 
mentarium of psychometric devices already 
available to the psychologists. 


Received October 1, 1954. 
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Vocational Interests of Male and Female Social Workers * 


Robert L. McCornack 


University of Minnesota 


This study was undertaken to assist schools 
of social work in the task of selecting and re- 
cruiting students. In addition, the purpose 
was to make the two forms of the Strong Vo- 
cational Interest Blank (SVIB) more useful 
in the guidance of students. It was therefore 
proposed to develop a social worker key on 
both the men’s and women’s forms of the 
SVIB. The women’s forms presently contain 
such a key developed on subjects contacted 
between 1934 and 1941. Because of the 
rapidly changing character and professionali- 
zaiion of social work, it is possible that the 
key was in need of revision. The important 
practical question of whether a separate key 
for each sex is necessary was also studied. 


Sample 


By means of a table of random numbers, 
a sample of 700 women and 700 men was 
drawn from the official Membership Direc- 
tory for 1952 of the American Association of 
Social Workers. Each subject was asked by 
mail to answer the test form appropriate for 
his sex. In addition, 250 of each sex group 
were asked to answer both the men’s and 
women’s form of the SVIB. Four follow-up 
letters were used in obtaining 1,183 returns. 
This represents nearly 87% returns when the 
34 undeliverable mailings were eliminated. 

Before inclusion, a respondent must have 
met certain criteria which were selected be- 
fore any returns were received. These were 
full membership in the professional organiza- 
tion, three years of full-time experience, less 
than 60 years of age, currently active full 
time in the profession, residence in the conti- 


1The writer is presently at Wayne University. 
This research was made possible by a grant from the 
Louis W. and Maud Hill Family Foundation, and 
constitutes a portion of the writer’s Ph.D. disserta- 
tion done under the direction of Professor Donald 


G. Paterson. The writer wishes to express his ap- 
preciation for the extensive assistance given him by 
Professor John C. Kidneigh, Director of the School 
of Social Work of the University of Minnesota; and 
to Dr. F. L. Erlandson who kindly provided the 
male cross-validation group. 


nental United States, and completion of the 
test. These criteria resulted in the rejection 
of 223 respondents, nearly 19% of those re- 
sponding. The median age of the men was 
39, and 44 for the women. The mean amount 
of education for both sexes was about 18 
years. The two years of graduate work neces- 
sary for the master’s degree was possessed by 
83% of the men and 71% of the women. 
The mean number of years of experience was 
11 for the men and 15 for the women. In 
services performed, the men were almost 
evenly divided between administration and 
direct services to individuals. The women 
were predominantly in the areas of direct 
service or supervision of those giving direct 
service. Subjects were from nearly every 
state and represented every specialized field 
of social work. 


The Men’s Key 


The 1952 Social Worker key for the SVIB 
for Men was constructed by contrasting the 
responses of the 496 male social workers with 
those of Strong’s 1938 professional Men-in- 
General group. Strong’s item-weighting tech- 
nique was used. In addition to the key de- 
veloped on the entire group, two keys were 
developed on random halves of 248. A dou- 
ble cross-validation (2, 4) procedure was uti- 
lized by cross-validating each half-sample key 
on the half sample not used in constructing 
the key. The half-sample keys were highly 
related to the total-sample key as shown by 
the Pearson r’s of .996 and .991. 

Since the purpose of these keys was to 
separate the social workers from workers in 
other occupations on the basis of interests, 
the effectiveness of a key was measured by 
the amount of differentiation of social work- 
ers from Men-in-General. Key A, the key 
developed on half-sample A, was used to score 
the tests of the subjects in half-sample B. 
Key B, the key developed on half-sample B, 
was used to score the tests of the subjects in 
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Table 1 


Effectiveness of the Three Social Worker 
Keys for Men 








Mean 

Differ- 

ence in 

Social Percent- 
Social Social Men-in- Worker age of 
Worker Worker General SD Over- 
Mean SD Mean’ Units lapping 
63.6 —56.6 3.26 15% 
71.3 —52.3 2.92 19% 
68.5 — 50.4 


Key 





150.5 
155.6 
162.6 


KeyA 
Key B 
Key Total 


3.11 16% 





half-sample A. Table 1 contains the data 
necessary for judging the effectiveness of the 
three keys. 

Only one male social worker scored below 
any of the Men-in-General means. In terms 
of the standard deviation of the social worker 
group, the means were widely separated. In 
order to use the measure of overlapping com- 
monly used by Strong (5, p. 110), some esti- 
mate of the standard deviation of the Men- 
in-General standard deviation was needed. 
Strong has published such standard deviations 
for nearly all of his keys (6). Using the av- 
erage of these as an approximation to the un- 
available Men-in-General standard deviation 
on the social worker keys, percentages of over- 
lapping were computed. The percentages of 
overlapping in Table 1 are the perecentage of 
scores of one distribution that can be matched 
with scores of the other distribution, as de- 
scribed by Tilton (9). None of Strong’s pub- 
lished overlap figures for other occupational 
keys is lower than these. Kriedt’s (3) psy- 
chologist key probably has a lower overlap 
figure, however. 

The double cross-validation procedure yields 
cross-validated estimates of the effectiveness 
of only the half-sample keys. Of course, it 
is the 1952 Social Worker key developed on 
the total sample that is of interest. Erland- 
son (1), in an unpublished study, gave the 
SVIB to 75 male social workers who held a 
master’s degree in social work, largely from 
one university. These data were obtained 
through the mail with 83% returns. They 
provide a completely independent sample of 
fully qualified male social workers. On the 
total-sample key, the mean and standard de- 
viation for this group were 158.4 and 68.9. 
The mean and variance of this group did not 


differ significantly from the mean and vari- 
ance of the original group of 496 men, ¢ = 
0.492 and F = 1.013. The Men-in-General 
mean differed from this cross-validation group 
mean by 3.03 standard deviation units. Uti- 
lizing the same approximation, the percent- 
age of overlapping between the Men-in-Gen- 
eral and cross-validation group distributions 
was 17%. None of the 75 men earned a 
score below the Men-in-General mean. These 
results seem to confirm completely those ob- 
tained from the analysis of the primary data. 


The Women’s Key 


The 1952 Social Worker key for the SVIB 
for Women was constructed by contrasting 
the responses of the 464 female social work- 
ers with those of Strong’s Women-in-General 
group. The double cross-validation procedure 
was used with random halves of 232 in de- 
veloping half-sample keys C and D. Both 
half-sample keys were correlated with the 
total-sample key to the extent of .98. Table 
2 contains the data necessary for judging the 
effectiveness of the half-sample and _total- 
sample keys. 

Six female social workers scored below the 
Women-in-General mean on the total-sample 
key. Using the social worker standard de- 
viations as units, the means of the social 
worker and Women-in-General groups were 
widely separated, as shown in Table 2. 
Again, the standard deviation of the Women- 
in-General group was estimated by the aver- 
age of the published standard deviations on 
other keys (7). Parenthetically, it might be 
noted that this approximation coincided al- 
most exactly with the standard deviation of 
the Women-in-General group given by Strong 


Table 2 


Effectiveness of the Three Social Worker 
Keys for Women 








Percent- 
age of 
Over- 
Mean Units lapping 


29%, 
32% 
29%, 


Social Social Worker 
Worker Worker General SD 
Mean SD 


Key 





32.7 0.9 
34.3 14.7 
33.7 6.9 


Key C 
Key D 
Key Total 


79.4 
91.0 
87.4 


2.40 
2.22 
2.39 
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for this Social Worker key. Utilizing this ap- 
proximation, the percentage overlap figures 
in Table 2 were computed. Four of the keys 
presently used on the SVIB for Women have 
lower overlap figures. 

Strong’s Social Worker key presently used 
on the SVIB for Women was evaluated in 
three ways. First, the tests of the 464 fully 
qualified social workers sampled in this study 
were scored with Strong’s key. The mean 
score of these female social workers of 72.0 
was significantly lower than the mean of 85.8 
for Strong’s original criterion group, t = 2.69. 
The variances of the two groups did not differ 
significantly, F = 1.054. Second, by utiliz- 
ing the mean and standard deviation of the 
Women-in-General group published by Strong, 
the percentage of overlapping between the 
present sample and the Women-in-General 
distributions was found to be 50%. This 
does not compare favorably with the 40% 
overlap computed on the original criterion 
group (7, p. 2). Furthermore, neither of 
these overlap figures is as low as the overlap 
figures for the 1952 Social Worker key for 
women developed here. Third, the correla- 
tion between Strong’s key and the key de- 
veloped in this study was .73. In 28 revisions 
that Strong has made in keys for the men’s 
form of the SVIB, the average correlation be- 
tween the original and revised keys was .93 
(8, p. 48). The one lowest was .81. It 
would seem as though a revision of the wom- 
en’s Social Worker key was more needed than 
many of the revisions Strong himself has per- 
formed. 

Sex Differences 


There were 140 men and 139 women who 
had taken both forms of the SVIB. Elimi- 
nating the undeliverable mailings, these rep- 
resent only 57% returns out of the 250 per- 
sons in each sex group to whom both forms 
were mailed. Despite the low percentage of 
returns, it was found that these groups were 
not biased with respect to scores made on the 
two social worker keys developed in this 
study. 

In the male group the men’s and women’s 
Social Worker keys were correlated .75. In 
the female group, the correlation was .78. 
These correlations are not sensitive to abso- 
lute differences, however. The mean score of 
the men on the women’s key was 83.0. This 


mean does not differ significantly from the 
mean score of the female criterion group re- 
ported in Table 2, ¢ = 1.40. The mean score 
of the women on the men’s key was 121.7. 
This mean is significantly lower than the 
mean of the male criterion group found in 
Table 1, ¢ = 6.81. 


Conclusions 


Male social workers have homogeneous in- 
terests permitting the construction of a key 
which very successfully differentiates them 
from a Men-in-General group. Sex differ- 
ences exist which make it unwise to use this 
key on female subjects. 

Female social workers have homogeneous 
interests, making it possible to construct a 
key which differentiates them from a Woman- 
in-General group. This key was less suc- 
cessful than the men’s key, although it dif- 
ferentiates more effectively than Strong’s So- 
cial Worker key for the women’s form of the 
SVIB presently used.* No sex differences 
exist on this key. 


Received February 3, 1955. 
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A Further Study of Experience-Centered and Requirements- 
Centered Tests of Job Knowledge ' 


Harry M. Mason 


University of Illinois * 


In a recent issue of this Journal (3) the 
writer reported results of a study of job 
knowledge of Air Force airplane and engine 
(A & E) mechanics. The study compared 
experience-centered tests, derived from at- 
tempts to measure what workers may learn 
from jobs, with requirements-centered tests, 
designed to measure the extent to which 
workers or applicants meet formally pre- 
scribed job requirements. In the study to 
be reported here, revised tests of the two 
types were administered to 553 Air Force 
A & E mechanics, chosen to represent a 
different military mission, different aircraft 
types, and a different geographical region 
from the corresponding categories involved 
in the first study. (Tactical, rather than Air 
Training Command, C-119, C-124, and B-26 
rather than B-29 and B-50 aircraft, and 
Southeast States rather than Rocky Moun- 
tain States.) Results of the later investiga- 
tion, reported here, corroborate in general the 
results of the earlier study. 


Tests 


More complete descriptions of the tests 
which were used in both studies may be 
found in the earlier report, but the following 
summaries will be helpful here. 


Experience-centered tests. The Training Research 
Laboratory (TRL) Aviation Information Test is a 
revision of a test used earlier. Its 30 five-choice 
items are aimed at knowledge of airplane charac- 
teristics more available to persons doing aircraft 
work than to the general public, though not re- 
stricted or controlled for security reasons. 

The 7RL Maintenance Facts Test is a revision of 
the 75-item Maintenance Techniques Test used in 

1 This research was supported in part by the United 
States Air Force under Contract No. AF 33(038)- 
25726, monitored by the Air Force Personnel and 
Training Research Center. Permission is granted for 
reproduction, translation, publication, use, and dis- 
posal in whole and in part by or for the United 
States Government. 

2 Now at Kansas Wesleyan University. 


except three had more than this amount.” 


the earlier study. The new test contains 45 four- 
choice items, 22 of which are carried over unchanged 
from the original form. The test was derived from 
interviews in which 50 experienced mechanics de- 
scribed job operations which appeared to them to 
differentiate good from poor mechanics. 

Requirements-centered tests. The TRL Aviation 
Mechanics Technical Knowledge Test, Revised, con- 
tains 51 items previously found to discriminate ef- 
fectively between mechanics in early and late phases 
of A & E mechanics schools. The items used were 
obtained from mechanics school instructors. The 
test was revised by dropping from the original test 
form nine items which appeared to duplicate sub- 
ject matter covered in other items. 

To measure knowledge of applications of physical 
and mechanical principles, the 60-item Test of Me- 
chanical Comprehension, Form BB, by Bennett (1) 
and a 38-item TRL Basic Principles Test were used 
The Basic Principles Test was constructed for the 
present study. These tests served as replacements 
for two mechanical tests from the Airman Classifi- 
cation Battery, which were employed in the earlier 
study, and which appeared to be too easy for me- 
chanics having job experience. 

Mechanics required from 1% to 34 hours to an- 
swer the five tests, which were untimed. All tests 
were scored for number of right responses. The 
Test of Mechanical Comprehension was also scored 
according to the publisher’s formula, rights minus 
one-half the wrongs, to provide a comparison be- 
tween mechanics’ scores and published norms 


Criteria 


Amount of job experience was the principal cri- 
terion used in the earlier study. The present study 
employs AF-designated skill levels as criteria. Level 
3 mechanics are designated as apprentices, Level 5 
mechanics are called senior workers, and Level 7 me- 
chanics are designated as mechanic-supervisors or 
technicians. The two extreme skill levels, 3 and 7, 
are substantially nonoverlapping in amount of ex- 
perience; only 7 of 160 airmen in the Level 3 group 
had more than one and one-half years of job experi- 
ence when tested; while all of the 116 Level 7 men 
The inter- 


“Tables showing additional sample characteristics, 
raw test score distribution statistics, intertest correla- 
tions and other score breakdowns have been de- 
posited with the American Documentation Institute. 
Order Document No. 4727 from the ADI Auxiliary 
Publications Project, Photoduplication Service, Li- 
brary of Congress, Washington 25, D. C., remitting 





Experience-Centered and Requirements-Centered Tests 


Table 1 





Adjusted Mean Standard Scores of the Five Tests for Each Skill Level 





° 


Experience-Centered 


Main 


tenance 
Facts* 


Skill Level 


(Apprentice) 
(Senior mechanic) 


3.37 
5.30 
7 (Supervisor-technician 6.55 
Mean difference, Level 7 
minus Level 3 3.18 
Within-groups variance, 
standard scores 


(before adjustment) 
I *** 


2.50 


124.40 


Tests 


Requirements-Centered 
Mechanica! 
Compre 
hension 


Aviation 
Informa 
tion** 


Basic 
Prin 
ciples 


Technical 
Knowl- 


edge 


3.88 
5.17 


6.18 


4.00 4.30 4.33 
5.17 5.18 
s 


5.95 


3.04 


1.89 


3.24 
34.88 


3.85 


15.97 


3.84 
12.63 


* Maintenance Facts has significantly higher relationship to skill level than any other test except Aviation Information. 


** Aviation Information has a significantly higher relationship to skill level than Mechanical Comprehension 
over other requirements-centered tests is not shown to be significant. 


in degree of relationship to the criterion. 


level. 


mediate skill group, Level 5, contains many members 
overlapping the Level 3 group and several overlap- 
ping the Level 7 group in amount of job experience. 
Level 7 includes the majority of crew chiefs, these 
being practically absent at Level 3. The skill level 
criterion thus imposes additional requirements be- 
yond those implied in amount of job experience. 


Results 


Table 1 shows criterion group standard 
scores on the five tests, the tests being ar- 
ranged in order of amount of mean difference 
shown between extreme criterion levels. Since 
all tests show significant positive relationships 
to the criterion, it was necessary to determine 
whether or not the relationship of each test 
to the criterion was significantly greater than 
the relationship between the criterion and 
each of the other tests. This was accom- 
plished by taking the tests one pair at a time, 
determining whether each mechanic’s stand- 
ard score on Test A of the pair was lower 
than, equal to, or greater than his standard 
score on Test B, and then comparing the dis- 
tributions of these “difference scores” at the 
three criterion levels. A significant chi square 
would indicate that Test A was easier at low 
in advance $1.25 for microfilm or $1.25 for photo- 


copies. Make checks payable to Chief, Photodupli- 
cation Service, Library of Congress. 


Its superiority 
See text for details of the test of significance of difference 


*“** All F ratios in this table are significant beyond the 1% level, indicating that all tests show significant relationship to skill 
Heterogeneity of variance between skill levels is not significant at the 5% level for any test. 


criterion levels and harder at high criterion 
levels than Test B, or vice versa.* As indi- 
cated in Table 1, the Maintenance Facts test 
displayed a significantly higher relationship to 
skill level than any other test, and the other 
experience-centered test, Aviation Informa- 
tion, was significantly superior to Mechanical 
Comprehension. Though Aviation Informa- 
tion shows an observed superiority over other 
requirements-centered tests, its superiority is 
not significant statistically. Results of the 
earlier study are thus corroborated: experi- 
ence-centered tests again showed the higher 
relationships to the criterion. 


Discussion 


Results of the present study demonstrate 
that the superior criterion relationship of ex- 
perience-centered tests over requirements-cen- 
tered tests is not specific to one study, but 
persists when a group of subjects is tested 
which perform similar duties on aircraft of 
different types, at bases situated in a differ- 
ent geographical region. Further research 
might well consider whether subject matter 


3 The writer is indebted to W. E. Kappauf and to 
L. L. McQuitty for suggesting this general approach 
to determining significance of differential relationship 
to the criterion. 
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relating to duties, found in tests constructed 
for this job at other laboratories, is superior 
to subject matter related more closely to prin- 
ciples of mechanical functioning found in the 
same tests. If items aimed at description of 
duties are found to be generally superior to 
those aimed at knowledge of mechanical prin- 
ciples, this would be evidence that “duties” 
and “job theory” are differentially valid as 
subject-matter areas. However, a failure for 
this distinction to appear in tests constructed 
at other laboratories would suggest that the 
critical factor in superiority of the experience- 
centered tests used in the present series of 
studies lies in the method through which ex- 
pert mechanics’ insights were brought to bear 
upon development of test content. In the 


most effective experience-centered test used, 
Maintenance Facts, expert mechanics ppo- 
vided general problems, rather than test items. 
The item writer, alerted to a general prob- 
lem, reduced it to a specific case, often em- 
ploying the contrasts between work methods 
of good and poor mechanics, emphasized by ’ 


the experts, as bases for plausible item mis- 
leads. In the Aviation Information Test, an 
attempt was made to emphasize relatively 
easy distinctions between aircraft, rather than 
to test maximum performance. The idea that 
distinctions between good and poor mechan- 
ics lay generally in interests and values rather 
than in mental abilities, which guided produc- 
tion of the Aviation Information Test, had 
appeared in the interviews referred to earlier, 
and was reinforced by results of a study by 
McQuitty, Wrigley, and Gaier (2). 


Received April 22, 1955. 


References 


1. Bennett, G. K. Mechanical Comprehension Test, 
Form BB Manual (Rev.). New York: Psy- 
chological Corp., 1951. 

2. McQuitty, L. L., Wrigley, C., & Gaier, E. L. An 
approach to isolating dimensions of job suc- 
cess. J. appl. Psychol., 1954, 38, 227-232. 

3. Mason, H. M. A comparative evaluation of two 
approaches to job-knowledge test construction. 
J. appl. Psychol., 1954, 38, 384-389. 





The Journal of Applied Psychology 
Vol. 40, No. 1, 1956 


Comparison of Three Morale Measures: A Survey, Pooled 
Group Judgments, and Self Evaluations *'* 


Wilse B. Webb 


U.S. Naval School of Aviation Medicine 


and E. P. Hollander 


Carnegie Institute of Technology 


That there are many ways of defining mo- 
rale, and very little agreement about the best 
definition, is probably all too obvious. In his 
recent work, Viteles (8) has been led to com- 
ment that generalizations regarding morale 
have suffered by “the almost consistent fail- 
ure of surveys (and also of experimental 
studies) to deal with the problem within an 
appropriate context of theory . . . the most 
usual tendency has been to define the dimen- 
sions of morale in terms of what is revealed 
by the investigation” (8, p. 282). 

As one turns attention to the actual meas- 
urement of morale, an understandably dis- 
jointed state of affairs presents itself. It may 
be suggested that three interrelated condi- 
tions underlie this situation: first, varying in- 
vestigators have been prone to select a single 
dimension as a total definition of morale; 
second, few cross comparisons have been 
made of the measures reputedly tapping “mo- 
rale”; third, too frequently the measurement 
taken has been accepted as valid without fur- 
ther reference to behavior. 

Amid this turmoil, the attitude or question- 
naire survey has continued to thrive as a 
singularly prominent technique of morale as- 
sessment. The healthy respect it has achieved 
has largely stemmed from its pragmatic utility 
in pin-pointing causes of employee satisfac- 
tion or dissatisfaction. But this is not al- 
ways enough, for the “morale index” result- 
ing from such surveys has been noted, on 


1 The research reported here was completed at the 
U.S. Naval School of Aviation Medicine under ONR 
Contract NR154-098 between that institution and 
Tulane University. Opinions and conclusions are 
those of the authors. They are not to be construed 
as necessarily reflecting the view or the endorsement 
of the Navy Department. 

2 The authors wish to express their thanks to David 
J. Mitchell for his assistance in the analysis of the 
data. 


occasion, to produce the apparent anomaly of 
a negative relationship between the presum- 
ably favorable responses and productivity 
records. A case in point is reported by Katz 
from the data yielded in a large-scale indus- 
trial study (6, p. 160). 

In contrast to the rather extensive use of 
various forms of the survey, only infrequent 
utilization has been made of self or co-worker 
evaluation of individual morale. With the 
heightened popularity of such devices, their 
relatively extended history in military psy- 
chology (5) and the particularly noteworthy 
power of the peer rating in predicting behav- 
ioral criteria (3), the virtually exclusive reli- 
ance on the more traditional techniques is 
somewhat surprising. 


Problem 


Taking these considerations into account, it 
would appear desirable to initiate exploratory 
studies which would permit: (a) the determi- 
nation of interrelationships between several, 
simultaneously produced indices of morale, 
oriented about a single, common definition; 
and (6) the evaluation of these measures as 
regards their relative validity against some 
suitable behavioral criterion of morale. This 
paper reports such a study. 


Procedure 


Within the context of naval air training, morale 
was defined quite simply as “an interest in and en- 
thusiasm for the naval air program.” This defi- 
nition was borrowed from one proposed by Smith 
and Weston (7, p. 1) in their Air Force study. It 
implies a felt need to succeed in, to be part of, and 
to contribute to naval aviation. Most significant, 
perhaps, it implies a desire to complete the training 
curriculum, so far as personal desires operate to that 
end. This definition was deemed particularly ap- 
propriate since (a) measurement could be performed 
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by the several methods suggested, and (6b) it was 
practical in that measured differences might logically 
result in variation in training proficiency. 

On the basis of a long-standing interview study of 
trainee morale (1), a 20-item questionnaire was de- 
veloped, directed at probing attitudes which would 
reflect variations in morale within the definition used. 
The items utilized were based upon a thorough un- 
derstanding of the problem areas in the training pro- 
gram. Prior to the development of this final form, 
a detailed morale survey had been completed for the 
information of cognizant authorities (2). The 20- 
item form ultimately used represented a drawing out 
and refinement based on earlier analyses. There was 
every reason to expect that this questionnaire suited 
the purposes of this study by providing the most 
penetrating morale survey for the population in- 
volved. 

From previous work with the peer nomination 
technique, reported elsewhere (4), a special form 
was developed on which each cadet was asked to 
nominate, in order, the three men in his section 
whom he considered “highest” on “interest in and 
enthusiasm for naval aviation,” and the three men 
whom he considered “lowest” on this variable. 

In addition to these measures, each cadet was 
asked to rank himself on this interest and enthusiasm 
variable, in comparison with his section mates, by in- 
serting a number in a blank space. Since this rank- 
ing was obtained on the same form as the peer 
nominations, it was presumed to be based upon sub- 
stantially the same psychological set. 

All three of these devices were administered to 


eight cadet sections graduating from a four-month 
preflight curriculum at Pensacola in the winter of 


1953-54. A total of 210 cases, with each section 
composed of approximately 25 men, was thus ob- 
tained. The group estimate of a cadet’s morale was 
secured by weighting “high” nominations + 3, + 2, 
+1, and “low” nominations — 3, — 2, — 1, and then 
algebraically summing these weights. Fortunately, 
the problem of “unnominated” cases was minimal as 
only six cases fell in this category. For each section, 
a rank-order score of cadets was then developed. 
These data and the self-rank data were then con- 
verted to rankits to permit comparisons across sec- 
tions. 

The morale inventory was empirically scored by 
taking the 50 men having the highest peer nomina- 
tion score and the 50 men having the lowest peer 
nomination score and determining the items of the 
morale survey which significantly discriminated be- 
tween these groups on the basis of tetrachoric cor- 
relations. Twelve items were found to yield such 
discrimination; the morale survey was then scored 
for all subjects, using these 12 items. This pro- 
cedure gave a score for the morale survey which was 
maximized in the sense that it might have the high- 
est possible correlation with the other variables. 

Estimates of the reliability of the morale survey 
and the peer nominations were determined. An odd- 
even reliability for the survey form resulted in an 


average of .55 for the eight sections; when corrected 
by the Spearman-Brown formula, this became a re- 
liability estimate of .71. An analysis of variance 
estimate of the reliability of peer nominations yielded 
an r of .82. 


Results 


Table 1 provides the intercorrelations of 
the maximized morale inventory score, the 
self ratings, and the peer nominations (or 
group estimate) of morale. These intercor- 
relations were secured by first determining the 
intrasection correlations and then averaging 
these, taking into account the N of the groups 
involved. Although the r’s reported are sig- 
nificant, they are sufficiently low in magnitude 
to suggest that the measures used resulted in 
different estimates of a person’s “morale,” 
even when this concept of morale was based 
upon a common definition. One may now ask 
which of these appears to best predict a per- 
formance criterion of morale. 

In keeping with the definition set forth 
above, the simplest performance measure of 
morale which appeared appropriate was the 
pass-withdraw criterion, that is, whether the 
individual remained in training or voluntarily 
withdrew during a five-month period of flight 
training following preflight. During this pe- 
riod more than 90% of the voluntary with- 
drawal during flight training will occur. 

In Table 2 biserial correlations are pre- 
sented for these three measures against this 
pass-withdraw criterion; 16 of the original 
210 cases had withdrawn at this time. Be- 
cause of the low proportion of cases in the 
withdrawal group, the standard errors of these 
biserials are, of course, considerable. In the 
light of this unavoidable but vexing problem, 
an additional correlational estimate was ob- 
tained using Kendall’s tau. These nonpara- 
metric estimates are also presented in Table 
2. The actual extent of these correlations 


Table 1 


Intercorrelation of Three Estimates of Morale 








Variables 





Self estimate vs. group estimate 
Survey estimate vs. group estimate 
Survey estimate vs. self estimate 





** P,<7.01. 
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Table 2 


Validity Coefficients Against a Pass-Withdraw Criterion 
for Three Estimates of Morale 








Kendall’s 
Tau 


oie With 


Predictor Criterion 





Group estimate 90 27 
Self estimate 83 .22 
Survey estimate 30 Al 





with the criterion of withdrawal will, of 
course, require further accumulation of cases 
from succeeding samples. However, our ma- 
jor concern with these coefficients was the 
relative relationships that they revealed. 

As striking as the validities shown in Table 
2 may be, a tabulation by quartiles of the 
withdrawal group, on the three measures ap- 
plied, is perhaps even more impressive. A 
distribution of the 16 withdrawal cases fall- 
ing in each quartile of the scores for each 
variable is given in Table 3. From a study 
of this table, it will be noted that 50% of 
the men who subsequently withdrew were 
among the lower 25% of the estimates of the 
total sample. Nine of the 16 withdrawals 
had been judged by their peers as being in 
the lower 25% on the “interest and enthusi- 
asm” variable. All of the estimates, when 
compared with the expected distribution of 
scores, show a tendency for withdrawals to 
have lower morale of the variety here defined 
than successful cadets. 

It may be noted in passing that six cadets 
had been eliminated on grounds of “flight 
failure,” i.e., being unable to learn to fly. 
Since these cases would primarily represent 
deficiencies in aptitude rather than motiva- 
tion, we would expect little relationship be- 


Table 3 


Quartile Distributions on Three Estimates of 
Morale for 16 Withdrawal Cases 








Self 


Group 
Estimate 


Estimate 


Survey 
Estimate 


Qa (upper 25%) 1 1 1 
Q; 


Q: 
Q, (lower 25%) 





1 1 5 
6 5 5 
8 9 5 





tween our measures and this condition. Such, 
in fact, was the case: on self ratings 50% 
placed the enthusiasm level as above the av- 
erage of the group, 50% below the average; 
on group rating one-third of these cases were 
placed in the upper 50% of the ratings; the 
morale survey scores place two-thirds of these 
men above average in their favorable re- 
sponses. 


Discussion and Conclusions 


The findings presented may be interpreted 
to mean, first, that if we had tapped morale 
by one method rather than another, consider- 
able variation in estimates of a given indi- 
vidual’s “morale” within his group would have 
been obtained. This is so, even within the 
framework of the common definition applied. 
It lends reinforcement to the consideration 
that the most rigorous specification of the 
term is required. The implicit problem now 
posed might be which of these measures is a 
“true index” of the individual's morale. 

In the introduction to this study it was 
suggested that the meaningfulness of a pre- 
dictor of morale might be appropriately tested 
by its relationship to the reality of a perform- 
ance criterion. It is evident from the second 
analysis that the measures obtained by peer 
nominations and self ratings do show a strong 
relationship to the pass-withdraw criterion 
derived from subsequent training. On the 
other hand, the survey estimate of morale 
bears only a relatively limited relationship to 
this criterion. 

The findings that the group estimate yields 
a high estimate validity against this perform- 
ance criterion, is not inordinately surprising 
in light of other evidence regarding the pene- 
trating quality of pooled group judgments 
(3). It is totally reasonable to suppose that 
the members of a group, living together un- 
der intimate conditions for several months, 
might have a highly sophisticated “group un- 
derstanding” of member characteristics. In a 
military setting such as this, where “interest 
and enthusiasm” are significant in day-to-day 
activity, this sophistication might be height- 
ened all the more. 

Considering the survey estimate, its rela- 
tionship to the criterion is considerably lower 
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than that evidenced by the other two meas- 
ures. This finding may stem from two 
sources: the limitation of this specific survey 
form or a limitation which may be inherent in 
the survey instrument approach to morale, 
where the prediction of a performance cri- 
terion is involved. It may only be asserted 
in response to the first point that the survey 
employed was a refinement of one that had 
already demonstrated its operational utility 
of the traditional purposes. There was sub- 
stantial reason to believe it totally adequate. 

Without dispensing with the survey tech- 
nique, therefore, it would seem that credence 
has been lent to the view that for purposes of 
predicting performance criteria of morale, self 
ratings and peer nominations may have con- 
siderable usefulness. It is, of course, true 
that such indices cannot be derived from 
simply any group but must be employed only 
under applicable conditions. While survey 
instruments may continue to provide cues to 
administrative action where morale is con- 
cerned, nevertheless, there is merit in consid- 
ering the utilization of the peer nomination 
or self-rating techniques for the handling of 
prediction problems of the sort encountered 
here. 


Summary 


A study was completed to determine the 
relative validity of three techniques of mo- 
rale assessment in predicting an operational 


criterion in naval air training. Morale was 
defined simply as an “interest in and enthusi- 
asm for naval aviation.” An attitude survey, 
peer nomination form, and self rating were 
used as the measuring instruments. 

Eight sections of naval aviation cadets 
graduating from preflight (NV = 210) were 
used as the study sample. Scores were de- 
rived for each of these measures and inter- 
correlated with one another. These coeffi- 


cients were found to be of relatively low 
magnitude. As for their validities, it was 
found that group estimates (ry, = .90) and 
self estimates (rp, = .83) yielded the highest 
relative relationships with a criterion of pass- 
withdraw after five subsequent months of 
training. The coefficient for survey estimate 
was notably lower (ryis = .30). Although the 
standard errors of these coefficients are largely 
due to an extreme split, their relative magni- 
tude is informative. 

The conclusion was offered that peer nomi- 
nations and direct self descriptions may have 
greater utility in reflecting involvement in a 
training program than does the traditional 
“morale index” derived from a survey instru- 
ment. 


Received March 21, 1955. 
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Previous work, e.g., Davis (1) and Ven- 
ables (8), has indicated that features of 
skilled performance are related to aspects of 
personality defined clinically or psychometri- 
cally. Davis found that performance in a 
simulated aircraft cockpit could be considered 
as lying on an overactive-inert continuum, 
and that this was in turn related to clinically 
rated anxiety and hysteria. Individuals who 
showed excessive overactivity or inertia were 
also found to be more neurotic than those 
showing a median type of performance. These 
findings have been confirmed and extended in 
work by the writer using a simpler type of 
apparatus, and both psychometric and clini- 
cal ratings of personality type. 

An opportunity was given to investigate 
similar relationships in a practical setting by 
further study of subjects used by Lewis (6) 
which was concerned with the relationship 
between consistency in performance and the 
standard of driving skill of his subjects. He 
found that consistency was higher in groups 
of skilled drivers than in an unskilled group. 

The present paper is concerned with rela- 
tionships between these consistency measures 
and scores on two questionnaires that meas- 
ured “neuroticism” (emotional instability) 
and introversion-extraversion. The latter di- 
mension was measured by a version of Guil- 
ford’s R (rhathymia) scale which has been 
shown by Eysenck (2) to have a high load- 
ing on a second-order factor identified as in- 
troversion-extraversion and minimal loadings 
on a factor identified as neuroticism. Hilde- 
brand (4) found that the R scale distin- 
guished between anxiety states and hysterics 
in a neurotic group. 

It is to be expected that subjects (Ss) 
whose performance deviates from a median 
position on the overactive-inert continuum will 


1 The author wishes to acknowledge the help from 
Inspector F. R. Priestley of the Essex County Police 
Advanced Driving School. He is also indebted to 
Dr. Alastair Heron, Mr. R. E. F. Lewis and Dr. 
N. H. Mackworth for their cooperation. 
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show a performance having features associ- 
ated with lack of skill and, therefore, follow- 
ing Lewis’ findings of an association between 
skill and consistency, will be most inconsist- 
ent in performance. In view of the cited find- 
ings of Davis and Venables, these Ss are likely 
to be those who occupy the extreme of the 
introversion-extraversion dimension, and who 
have higher neuroticism scores. Whether the 
less consistent drivers will tend to one ex- 
treme of the introversion-extraversion dimen- 
sion more than another is a matter for specu- 
lation. It would seem, however, that the poor 
driver will show a greater tendency to extra- 
version than introversion in view of the rela- 
tion of extraversion with tendency to inert 
performance under affect, (cf. 1, 8). This 
inertia would, it is thought, produce lax con- 
trol of the car which would show as incon- 
sistent behavior between occasions. 


Method 


Three groups of Ss were tested. Group 
1 consisted of 10 skilled police driving instructors or 
motor-patrol drivers who had been trained at an ad- 
vanced motoring school. Group 2 consisted of 10 
skilled car club drivers not police trained, and Group 
3 contained six drivers of lower skill than Groups 1 
or 2, all of whom, however, had driven for at least 
three years. The selection of Ss was made by the 
Chief Driving Instructor of the Essex Police Ad- 
vanced Driving School. 

Tests. The neuroticism test was that of Heron 
(3). This consists of a 74-item questionnaire of 
which 40 items comprise the neuroticism or emo- 
tional instability measure proper, the remainder be- 
ing made up of buffer items and items selected from 
the “Lie Scale’ of MMPI. This questionnaire had 
a misclassification rate of 6% when validated on 
groups of hospitalized neurotics and normal industrial 
subjects. The introversion-extraversion questionnaire 
is an item-analyzed version of the Guilford R 
(rhathymia) Scale modified for use in England. 
Both tests are in either pencil-and-paper version, or 
in the form used by Heron where the items are pre- 
sented singly on cards from a box. These cards are 
then placed by S in boxes via slots marked TRUE 
and NOT TRUE. The pencil-and-paper version was 
used for Group 2, the questionnaires being sent out 
by post with a covering note. All Ss completed the 


Subjects. 
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Table 1 


Rank Correlations oi Driving Consistency Measures with Personality Scores 








Tests n 


“6-Week” 
Consistency 


“Same-Day” 
Consistency 


Tau P Tau 





Neuroticism 10 
10 
6 


Extraversion 10 
10 
6 


Deviation from mean 1 10 
introversion-extraversion 2 10 
3 6 


—.46 054 
+.02 >.5 
—.60 .068 


—.56 
+.02 
—.80 


—.49 
—.36 
—.40 


.045 
110 
180 


—.17 
— 39 
—.33 


—.49 
— .08 
— .66 


045 
>.5 
048 


—.73 
—.11 
— .66 





* Pairs of correlations whose difference is significant at less than .0S level. 


forms and returned them. The tests were adminis- 
tered personally to Groups 1 and 3 by the “:lot-box” 
device. 

Driving performance measures.2, Two measures 
were used in the analysis. These were for consist- 
ency of performance (a) between two periods on the 
same day, and (b) between two periods six weeks 
apart. They were based on measures of vehicle ac- 
celeration and deceleration. These were made by the 
photographic recording of a damped accelerometer 
every 5 yards over a distance of 150 yards before 
and after four corners of a test track on an airfield 
runway. 


Results 


Using Kendall’s (5) rank-correlation method 
for tied ranks, rb coefficients were calculated 
for relationships between consistency scores 
for “six-week” and ‘‘same-day” periods and 
neuroticism or emotional instability, intro- 
version-extraversion, and a score represent- 
ing deviations regardless of sign from the 
mean introversion-extraversion score for each 
group. These rank correlation coefficients were 
calculated for each group separately. The 
resulting coefficients are shown in Table 1. 

Probabilities were calculated by the use of 
Kendall’s method for use with tied ranks, a 
correction being made for continuity. 

It is seen from Table 1 that in Groups 1 
and 3 there is a significant tendency for in- 
consistency over two periods on the same 
day, and over two periods six weeks apart to 


2 For full details see Lewis (6). 


be associated with neuroticism or emotional 
instability, and deviation from mean intro- 
version-extraversion scores. 

There is also a significant tendency in 
Group | for driving inconsistency on the same 
day to be associated more with the extra- 
verted than the introverted end of the di- 
mension. This tendency is found in Groups 
2 and 3 although not at probability high 
enough to reach the usual levels of accept- 
ability. 

The associations between driving perform- 
ance and neuroticism and deviation from mean 
introversion-extraversion, found in Groups 1 
and 3, are not repeated in Group 2. While 
only two differences in correlation between 
Groups 1 or 3 and Group 2 reach a signifi- 
cant level, the consistency of the appearance 
of these differences is a finding which war- 
rants further attention. 

At the present time it is only possible to 
make suggestions to account for this lack of 
association. The first to be considered is that 
Group 2, in answering the questionnaires by 
post, did not give such valid answers as those 
of Groups 1 and 3 who used the “slot-box” 
version under supervision. While this ex- 
planation cannot be dismissed, it seems un- 
likely in view of the wide range of neuroticism 
scores and also the low mean “lie” score ob- 
tained by Group 2. The mean “lie” score on 
this group is 1.88, and should be compared 
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with that of 2.50.0n Group 1 where signifi- 
cant associations between personality and per- 
formance measures were obtained. If the va- 
lidity of the personality measures in Group 2 
is accepted, we must seek other explanations 
for the lack of association between person- 
ality and performance measures in this group. 
The most likely explanation suggested by the 
present data is that exceptionally high mo- 
tivation overlaid any variance in performance 
due to neurotic or emotionally unstable per- 
sonality. Two slight pointers giving tenta- 
tive confirmation of this explanation are: (a) 
the nonsignificant but fairly sizable correla- 
tion of inconsistency with extraversion. In- 
troversion or anxiety is identified in some 
work, e.g., Taylor (7), with high drive in the 
Hullian sense, and its converse, extraversion, 
therefore might possibly have relation to low 
motivation. If this can be accepted, lack of 
motivation might bear some low relation to 
inconsistency. (6) The rank order of con- 
sistency over the “same-day” and “six-week”’ 
periods is very close in Group 2 (tau = .93 
P = .000072), while the corresponding figures 
for Groups 1 and 3 are .51 (P = .023) and 
.53 (P= .102). It may thus be that the 
competitive motivation between the members 
of Group 2, that shows up in their regular 
maintainance of ability order, overcame any 
variance due to other factors. This was not 
so in the case of Groups 1 and 3 whose mem- 
bers although known to each other were not 
apparently motivated by the same team com- 
petitive spirit as Group 2. 

Examination of the range of personality 
measures in each group shows that the Skilled 
Police Group 1 is the most homogeneous. The 
range of neuroticism scores in this group is 
from 3-9, i.e., 7 points; and of introversion- 
extraversion scores from 10~21, i.e., 12 points. 
Similar ranges for Group 2 are 1-10, i.e., 10 
points; and 12-28, i.e., 17 points. The fig- 
ures for Group 3 are 0-15, ie., 16 points; 
and 4-22, ie., 19 points, respectively. It 
was not thought that the usual methods of 
comparing size of variance could be applied 
in view of the small numbers involved in this 
study. If, however, the deviations from the 
combined mean neuroticism score of Groups 
1 and 3 are ranked and the ranking correlated 


with the Group 1-3 dichotomy, using the 
method of Whitfield (9), the resulting corre- 
lation coefficient is significant at the .01 
level, indicating greater spread of scores in 
Group 3. Similar treatment of the intro- 
version-extraversion score does not produce a 
significant result. Comparison of Groups 1 
and 2 does not show significant difference in 
range. What selection processes have been 
at work to produce the relative homogeneity 
of personality in skilled police drivers is not 
clear, but it is interesting that in spite of this 
homogeneity the effects of personality differ- 
ence are still felt. 


Conclusion 


These findings of relation between person- 
ality measures and consistency of driving per- 
formance are of some practical importance but 
need confirmation with large numbers. It 
would seem valuable in future work to deter- 
mine how far actual skill in driving rather 
than consistency of performance is related to 
personality measures. Of particular interest 
is the lack of significant correlation of per- 
sonality measures with driving consistency in 
Group 2. If the suggestion of higher motiva- 
tion in this group is accepted, it can show that 
in spite of psychological handicap, poor driv- 
ing performance can be overcome at least to 
some extent by adequate enthusiasm. 


Summary 


1. Two scores of driving consistency were 
found to be related negatively to neuroticism 
and to extremes of introversion-extraversion 
on two groups consisting of highly skilled, 
and lesser skilled police drivers. 

2. These relationships were not found in a 
group of skilled motor-club drivers. 

3. There was in one case a significant tend- 
ency and in other cases suggestive tendencies 
for extraversion to be more related to incon- 
sistency than introversion. 
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Orientation 


Published research on the interview over 
the past 40 years seems to have led to a 
growing distrust of it as an instrument for 
measuring personality. The particular investi- 
gations on which this distrust is based reveal 
great diversity in the data and results, as may 
be seen in Wagner's survey of the literature 
(17). They were, however, very diverse in 
their form and purpose, and it is therefore 
illogical to conclude that because the findings 
were sometimes apparently contradictory, an 
interview is an unreliable instrument. Cer- 
tain kinds of missile aimed by certain people 
at certain kinds of target often go wide; but 
it cannot, therefore, be inferred that any mis- 
sile aimed at any target will necessarily be in- 
accurate. “Interview,” like “missile,” is a 
general term, and it is only when the material 
obtained in diverse enquiries is categorized 
according to the different forms the interview 
took, and according to ‘the different pur- 
poses for which it was used, that order begins 
to emerge. This paper, although restricted 
mainly to a consideration of the interview as 
an instrument for predicting occupational 
adaptation, will begin by examining the gen- 
eral problem. 

The experimental interview reported in the 
literature has varied a great deal in form. 
Sometimes it has been almost amorphous, 
with no predetermined order except that im- 
posed by its objective, viz., the assessment of 
abilities or traits. Recently there has been 
a tendency to introduce a “pattern” (11), or 
a standardized form (6). This tendency, 
commendable though it is in some respects, 
has led some investigators to lose sight of the 
essential character of the interview and to in- 
troduce features which belong more properly 
to psychological testing than to interviewing. 

1 Research carried out under the Medical Research 


Council (Unit for Research in Occupational Adapta- 
tion), London, England. 


This is true of Moriwaki’s work (10), and 
more particularly of Snedden’s (15). 

Perhaps the most satisfactory attempts to 
crystallize the form of the interview are those 
in which the interviewer uses a chart with a 
rating scale (e.g., 14). This leaves him free 
to vary his procedure according to the situa- 
tion, and yet demands that by the end of the 
interview he will have arrived at certain spe- 
cific evaluations. 

There is a tendency, exemplified by Wagner 
(17), to conclude from such work that the 
more definite the form of the interview, the 
more valid it will be. This is a premature 
conclusion, however, which does not allow for 
an important uncontrolled variable, viz., the 
aim of the interview. 

In some investigations the aim has hardly 
been defined at all, as in studies on predicting 
success in salesmen (4), and it is not surpris- 
ing that the reliability of the interview used 
for this purpose has been consistently poor. 
What is surprising is that such experimental 
work, built on a very insecure foundation, 
should be so frequently cited, for example, by 
Hollingworth (5) as evidence against the re- 
liability and validity of the interview. The 
only valid conclusion that could be drawn is 
that “when several interviewers with varying 
degrees of experience interview subjects by 
any sort of technique, to predict their sales 
ability, they will give widely differing predic- 
tions.” 

In other studies the aim of the interview 
has been strictly defined—the assessment of 
more or less specific traits. Wagner (17) has 
listed 96 such traits which he found in his 
survey of the literature. Difficulties still 
arise, however. The traits selected for as- 
sessment are often taken from different levels 
of complexity in the personality structure; 
some are relatively simple and easily defined 
(such as intelligence, neatness, cheerfulness, 
quickness, enunciation, frankness, and cour- 
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tesy) whereas others are much more complex 
and hard to define (such as refinement, rea- 
sonableness, character, integrity, judgment, 
and motivation). Moreover these traits dif- 
fer widely in the methods by which they can 
be assessed. For some, fairly simple objec- 
tive tests suffice; their assessment does not 
require a dynamic situation such as the in- 
terview provides: these are the relatively 
static traits, such as intelligence and knowl- 
edge. For others, although their assessment 
does not depend essentially on the dynamic 
situation in the interview, the assessor has to 
use judgment; consequently his own experi- 
ence and emotional development play a part 
in appraising them: these are the traits that 
he infers chiefly from the subject’s history 
(e.g., social adjustment, resourcefulness, punc- 
tuality, originality, leadership, ability to get 
along with others). There is a third cate- 


gory of traits, whose assessment depends es- 
sentially on the dynamic situation of the in- 
terview; these can be subdivided into traits 
which become apparent immediately, such as 
poise, manner, and neatness; and traits which 
become apparent only after prolonged inter- 
personal relationship, such as sincerity of 


purpose, motivation, ability to present ideas, 
emotional balance, friendliness, responsive- 
ness, courtesy, etc. 

There is one further element of confusion— 
diversity in the criteria used for validating 
the predictive interview. In some cases these 
criteria have been specific, as in the valida- 
tion of intelligence assessments against IQ 
test results (9, 10). But in studies of the 
prediction of job success, for example, differ- 
ent criteria have been used: in one investiga- 
tion, the result of written examinations for 
the post (2); in others, work performance, 
though if this is judged by the length of time 
the individual remains on the job (6), so 
many uncontrolled variables are involved that 
its value as a criterion is very questionable. 

On the whole, then, experimental work so 
far done on the interview has been unsatis- 
factory. When the objective is vague and ill 
defined and no particular form has been given 
to the interview, as in the salesmen studies 
(4), the reliability can be shown to be very 
poor (5), and it follows that the validity, too, 


would be very poor (7). Even when the ob- 
jective is a specific, relatively static trait, 
such as intelligence, and the interview is 
amorphous, the validity is still poor (9). But 
when the form of the interview is made more 
definite, as in the work of Moriwaki (10), 
the validity is significant (.63). When still 
further definition is given to the form of the 
interview, as in the work of Snedden (15), 
the validity is as high as .82 and .96; but 
here the form of the interview has been so 
rigidly defined that little use is made of that 
subtle dynamic play of interpersonal stimuli 
which gives an interview its essential char- 
acter. 

In some investigations the interview has 
been used to assess more dynamic; character 
traits—with varying success. Corey (1) found 
its validity low in respect of some poorly se- 
lected traits such as conceit, humor, and re- 
finement. But Viteles (16), who set out to 
assess some 22 traits, rated by judges, ob- 
tained validities of .58 and .46. In McMur- 
ray’s investigation (11), where the objectives 
were fairly specific and where some form was 
given to the interview, the validity of the in- 
terview was high against the criterion of fore- 
men’s evaluation (.68 in a series of 578, and 
.61 in a series of 84). McMurray used what 
he called a “patterned” interview, and took 
account of historical data which do not con- 
tribute to the essential character of the in- 
terview. 

In Rundquist’s investigation (13), the in- 
terviewers did not have access to records of 
any kind regarding the 1,359 subjects inter- 
viewed; they were asked to assess certain 
characteristics according to the subject’s re- 
actions in the social situation of the inter- 
view and not according to the content of his 
remarks. The assessments made in this in- 
vestigation gave a reliability of .87 and a va- 
lidity of .37 against the criterion of opinions 
expressed by the subjects’ intimate associates. 

The dynamic characteristics assessed in 
the work reported by Freeman (3) might be 
expected to emerge effectively in the inter- 
view. They were emotional stability, domi- 
nance, physical poise, resourcefulness, speed 
of adjustment, and egocentrism. In spite of 
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the complex nature of some of these, correla- 
tion coefficients from .33 to .74 were reported. 

In the light of the foregoing review, cer- 
tain guiding principles are suggested as neces- 
sary in experimental work on the interview. 
The first is that the objective must be spe- 
cific. The traits or characteristics selected 
for assessment should be well defined and 
should be restricted to those which are not 
adequately measured by objective tests and 
which seem to depend on the dynamic “field” 
(in Lewin’s sense, 8) of the interview situa- 
tion. Secondly, the interview should not be 
formless; yet if it becomes very restricted in 
form, it loses the essential character of the 
interview; insofar as it takes on the nature 
of an impersonal psychological test, it is un- 
able to use a spontaneous, dynamic interac- 
tion between two people as a means of reliev- 
ing certain complex personality features. The 
sequence of the interview needs to be free, to 
allow for much spontaneity on the part of the 
interviewee, but the schematic interview chart 
should be completed after the interview, as 
recommended by Schmeltzer and Adams (14). 
Thirdly, the criterion against which the inter- 
view is to be measured must itself be reliable. 
Fourthly, though from the practical stand- 
point the employment interview must take the 
subject’s record into consideration, it is im- 
portant for experimental purposes to keep 
historical data at a minimum, in order that 
the essential character of the interview may 
be given full play and its value studied. 


Pilot Study 


An experiment was designed to explore and 
to test the validity of the essential character 
of the interview. 


Design 


The form and objectives of the interview. 
An interview chart was designed, to be used 
at a pre-employment interview for rating sub- 
jects’ attitudes on a three-point scale. The 
purpose of the chart was threefold: (a) to 
give form to the interview, (b) to exclude 
the use of any psychological test of ability, 
or biographical record, (c) to facilitate sta- 
tistical analysis of the findings. 

The attitudes of the subjects were consid- 


ered in six spheres (Table 1). There was no 
restriction on the sequence of the interview 
and the matters discussed did not need to go 
beyond the work experience, and general in- 
terests of the subject. 


Table 1 


Interview Outline 








. Formulation of Goal 
1. Clarity of preference for type of work. 
2. Adequacy of reasons for this choice. 
3. Entertaining objectives beyond the job itself— 
advancement in position. 
. Entertaining objectives beyond the job itself— 
acquisitive plans. 
5. Consistency of alternatives entertained. 
. Strength of Job Interest (past or present) 
1. Understanding of the procedure in his own job. 
2. Understanding of the procedure in preceding and 
succeeding processes. 
3. Interest in the application of the finished product. 
4. Interest in the firm within the industry. 
5. Interest in the industry within society. 
>. Strength of General Interests 
1. Active pleasures—sports. 
2. Constructive hobbies or accomplishments. 
3. Acquisitive hobbies or interests. 
4. Group activities—clubs. 
5. Assuming responsibilities and leadership in group 
activities. 
. Self-Regard 
1. Satisfaction in regard to past achievements— 
school, past work, social. 
2. Satisfaction in regard to present achievements— 
work, social. 
3. Concern over reputation for punctuality and 
regularity at work. 
4. Concern over enforced idleness. 
5. Any altruistic motivation bearing upon work. 
=». Acquisitive Perseverance 
1. Persistence in holding previous jobs. 
2. Adequacy in purpose in previous job changes. 
3. Savings or acquisitions of property. 
4. Readiness to undertake distasteful job for ac- 
quisitive purposes. 
‘. Nervous Tension 
1. Tension obvious during interview. 
2. Getting flustered when work scrutinized. 
3. Preoccupation with minor symptoms. 
4. Complaints of ready fatigability. 
5. History of frequent, nondescript ill health (hypo- 
chondriacal). 
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The chart could not be filled in from bare 
biographical statements. For example, in the 
sphere of “goal formation,” the answer to the 
enquiry about the type of work preferred 
might be “electrical mechanic.” The inter- 
viewer would try to assess the strength of this 
preference, by observing how the subject put 
it, by appraising the reasons given for the 
preference, and by noting the consistency of 
alternative preferences should the first choice 
be unattainable. 

The material indicating “job interest” could 
hardly be obtained otherwise than by inter- 
view. 

In the sphere of “general interests,’ bio- 
graphical statements were supplemented by 
the interviewer’s assessment of the degree of 
interest shown in each item. 

In the sphere of “self regard’ the inter- 
viewer tried to judge how satisfied or dis- 
satisfied the subject now was with his past 
achievement. During the experiment it was 
found necessary to change the items grouped 
under this head, as the information originally 
required proved too elusive. 

The items under “acquisitive perseverance” 
are more directly concerned with biographical 
fact and make little demand on the dynamic 
situation in the interview. This is also largely 
true in the sphere of “nervous tension,” ex- 
cept for Subsection 1. 

In totaling the score, all scores under Sec- 
tions A to E were given a positive sign; those 
under Section F were given a negative sign 
and they were deducted from the total. 


Subjects. The subjects (Ss) were 46 employees of 
a pharmaceutical manufacturing firm,? engaged in 
five comparable types of routine manual work. The 
majority were drawn from the same social and eco- 
nomic strata. All had left school at the minimum 
leaving age and with a few exceptions had attained 
less than average standard. Ages ranged from 16 to 
21, the majority being under 18 years of age. 

Validation criteria. Criteria consisted of the rat- 
ings made independently by four persons on differ- 


2I am indebted to Burroughs Weilcome & Co. 
(The Wellcome Foundation Ltd.) for permission to 
carry out the investigation at the Wellcome Chemi- 
cal Works, and in particular to Mr. Mendelson and 
other members of the Personnel Department, with- 
out whose active cooperation and help the inquiry 
would not have been possible; and to Professor 
Aubrey Lewis for his ready encouragement and 
guidance. 


ent levels of authority, namely, the training officer, 
the women’s personnel officer, the supervisor, and 
the charge hand. They were asked to base their 
ratings on (a) productivity (speed and accuracy), 
and (b) job relations. The rating was by subjective 
judgment, but to make it more reliable the four in- 
dependent ratings were combined to produce a single 
composite rating (to be referred to subsequently as 
“supervisors’ rating’’). 

The Ss were chosen from five separate “belts,” or 
sections, in each of which all members were engaged 
on roughly the same work. The assessors were asked 
to select from each of these sections those employees 
representing, in their judgment, the best 20%, the 
worst 20%, and an equal number representing the 
average worker. In this way it was hoped to mini- 
mize the number of Ss about whom there was some 
doubt in the mind of the assessors. Thus three 
categories emerged, (a) Best, (b) Average, and (c) 
Worst. Those in categories (a) and (c) were then 
ranked by the assessors. From the assessments in 
this form it was possible to arrive at an over-all 
score for each individual, ranging from the one 
ranked most highly by the majority of the assessors 
to the one ranked lowest by the majority. 

Procedure. The Ss were presented to the investi- 
gator without his knowing anything of the assessors’ 
rating, or of the S’s past history. The interviews 
lasted from 15 to 20 minutes each. No set sequence 
was followed, but by the end of the interview the 
investigator had to be satisfied that he could rate 
each S on the three-point scale on each of the sub- 
sections of the interview chart. The scoring was 
done after each interview had ended. In addition to 
scoring the chart, the interviewer kept brief notes of 
such items as the school grade claimed to have been 
reached, any particular ambition mentioned, and 
whether S was more or less voluble than average. 


Results and Discussion 


An over-all comparison was made by rank 
correlation between the interview ratings, rep- 
resenting the prediction of job suitability, and 
the “supervisors’ ratings,” representing the 
current performance of subjects, for each of 


Table 2 


Correlations Between Supervisors’ Ratings and 
Interviewer’s Ratings 








Group 








5. = Not significant at the 10% level. 
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the five groups (Table 2). Since the sub- 
sections in Section D were changed after the 
third group had been interviewed, only Groups 
4 and 5 are rated on the entire interview 
chart. For Groups 1, 2, and 3, Section D 
was omitted from the calculations. 

In three of the five groups the correlations 
were quite significant, suggesting that at this 
brief interview, with no other information 
about the subjects, it was possible to make 
rapidly a fairly accurate prediction, in terms 
of the supervisors’ assessment of suitability. 

The cases in which prediction corresponded 
badly with performance were examined fur- 
ther, attention being confined to Groups 4 and 
5—a total of 26 Ss—for whom the revised 
interview structure had been used. It was 
found that six of these Ss, for whom the di- 
vergence was marked, accounted for nearly all 
the variance. When these were éxamined in 
the light of additional notes taken at the in- 
terview, it was revealed that Ss noted as par- 
ticularly inhibited in their responses tended 
to be underestimated and those particularly 
voluble tended to be overestimated. This 
stresses the importance of allowing for the 
influence of the S’s verbal facility on the in- 
terviewer’s assessment. 

Another factor accounting for discrepancy 
was the direction in which the S’s ambition 
lay. The Ss who expressed the desire to be 
doing work of quite a different nature from 
their present job tended to be overestimated 
by the interviewer as to their suitability for 
this particular job. If the interviewer is re- 
quired to assess suitability for the particular 
type of work, then the preference for that 
work must be taken into account. 

The interview, as used, was not intended to 
assess intelligence. When intelligence, judged 
on the basis of school achievement, is left out, 
there is a tendency to underestimate the job 
suitability of the subjects of higher intelli- 
gence. Intelligence as indicated by school 
achievement is a useful indicator in the over- 
all assessment. The present study was not 
concerned with total assessment, which should, 
of course, include an evaluation of school 
achievement, but with the assessment of 
qualities brought out essentially by the inter- 
view. 


Table 3 


Correlations Between Supervisors’ Ratings and Those 
of the Separate Sections of the Interview 








Group 4 


Group 5 
(N = 12) 


Section 


A 14 

B —.13 N.S. 
_» 42 N.S. 
D 67 01 

, —.13 N.S. 





— 30 


*N.S. = Not significan‘ at the 10% level 





To ascertain if any of the six sections of 
the interview chart were specially significant, 
each section was correlated separately with 
the “supervisors’ ratings.” As Section D had 
been altered after the first three groups were 
interviewed, the correlations were worked out 
only for Groups 4 and 5 (Table 3). From 
this table it would appear that only Sections 
A, C, and D were significant, and that Sec- 
tion E might even be lessening the predictive 
value of the interview. It should follow that 
a briefer interview, based on Sections A, C, 
and D only, would give as accurate a predic- 
tion. Table 4 rather bears this out. 

The very low, even negative (though not 
statistically significant), correlations for Sec- 
tion E do not, of course, necessarily mean 
that “acquisitive perseverance” does not con- 
tribute to job suitability. It may well be 
that the interview did not permit a valid as- 
sessment of the qualities in question. For 
the particular type of S used in this investi- 
gation—young employees who had recently 
left school—three of the four items in this 


Table 4 


Correlations Between Supervisors’ Ratings and Those 
of Sections A, C, and D of the Interview 





Total 


Interview 


Sections 
A,C, D 
r p 
54 05 
66 Ol 


4 
5 


N 
1 
1 





*N.S. = Not significant at the 10% level. 
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section were hardly applicable. No signifi- 
cance should be attached at this stage, there- 
fore, to the correlations of Section E. 

It was an unexpected finding that Section 
B—“strength of job interest”—did not ap- 
pear to be significant. On reconsidering the 
responses of the subjects on this section, how- 
ever, it seemed that in these young female em- 
ployees the level of job interest was generally 
so low that no adequate differential assess- 
ment could emerge. 

The number of cases in which responses 
could be scored under Section F—“nervous 
tension’’—was so small that this section had 
to be considered almost superfluous for this 
particular group of employees. In calculat- 
ing the total interview rating, any deductions 
to be made under this section were halved. 

Correlations between the interview rating 
and the ratings of each of the four assessors 
(Table 5) showed a successive fall from that 
of the training officer to that of the charge 
hand (Group 5 was excluded from this part 
of the investigation because the situation at 
the factory at that time did not allow for 
adequate assessment by the training officer). 
This gradation could be interpreted in differ- 
ent ways. In assessing female employees, it 
could be assumed that the training officer 
would make the most comprehensive assess- 
ment of those who had passed through the 
Training Department, the women’s personnel 
officer the next most comprehensive assess- 
ment, and, least of all, the charge hand. This 
would mean that the interview, to be most 
valid, should correlate most highly with the 
training officer’s assessment, as it does in this 
instance. But it could on the other hand be 


Table 5 


Correlation Coefficients of Interview Ratings with 
Ratings of Assessors Individually 








Per- 

sonnel 

sonnel Welfare Charge 

Officer Officer Hand 
92 50 .60 43 
.94 94 70 30 
58 .66 .66 40 
88 71 50 10 


Per- 
Super- 
visor 








argued that the more closely the assessor 
works with the individual concerned, the more 
accurate is his assessment. In that case the 
interview rating should correlate most highly 
with the charge hand’s rating. The opinion 
of the management, however, is that the 
charge hand’s rating is the least accurate, as 
it is most likely to be biased by prejudices, 
based on superficial impressions, and insuffi- 
ciently comprehensive. 

The difficulty is illustrated by an individual 
who had been placed in category “A” by the 
women’s personnel officer, the supervisor, and 
the charge hand, whereas neither the inter- 
viewer nor the training officer had rated her 
in the “A” category. It emerged that she was 
on particularly good terms with her charge 
hand, who rated her highly. The charge 
hand’s reports in turn influenced the super- 
visor and the women’s personnel officer, who 
realized later, however, that while this em- 
ployee did well while working under that par- 
ticular charge hand, she was not satisfactory 
in her job relations generally. It was agreed 
that in this case the interview rating, coin- 
ciding with the training officer’s rating, had 
been more accurate. There is not enough 
evidence to show whose ratings are most ac- 
curate; for practical purposes, however, the 
value of the interview in assessing employees 
will be judged principally by the personnel 
management rather than by the lower levels 
of supervisory staff. 


Summary 


1. Research on the value of the interview 
for assessing personality and occupational fit- 
ness is reviewed. It is concluded that the 
most valuable interview is that which uses a 
standardized form, designed to assess com- 
plex, dynamic constellations of traits rather 
than relatively isolated, static traits. 

2. An investigation is reported, in which a 
pre-employment interview chart was used, 
based on six broad “attitudes.” The scores 
obtained on this chart by 46 workers were 
validated against over-all ratings based on 
separate ratings by four supervisors. Validi- 
ties ranged from .48 to .99. 

3. Relatively few subjects accounted for 
most of the variance. This is discussed with 
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particular reference to the underestimation of 
Ss who were inhibited in their responses, to 
the overestimation of Ss who were voluble or 
who expressed ambitions to do different work, 
and the neglect of the factor of school achieve- 
ment. 

4. The six “attitudes” were correlated sepa- 
rately with the supervisors’ ratings; those con- 
cerned with “goal formation,” “general inter- 
ests,” and “self regard” appeared to be sig- 
nificant. 

5. The interview ratings were correlated 
with the ratings of each of the four super- 
visors and the results are discussed. 

6. It is concluded that, properly used, the 
interview can play a reliable part in the over- 
all assessment of an individual’s qualities. 


Received February 17, 1955. 
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An Examination of Students’ Attitudes Toward Television 
as a Medium of Instruction in a Psychology Course 


Richard I. Evans 


University of Houston 


In recent papers by Evans et al. (2), Evans 
(1, 3), Husband (5), McKeachie (6), and 
Stromberg (7), several findings relative to the 
effectiveness of television as a medium of for- 
mal course instruction for psychology courses 
were reported. However, aside from the 
effectiveness of television instruction, other 
problems involved in television instruction 
are suggested. For example, implicit in the 
practical problem that a university faces in 
actually using television as a teaching me- 
dium, both from the standpoint of the broader 
aim of “adult education” and the more spe- 
cific aim of saving plant space in the face of 
rising enrollments, is the attitudes of the stu- 
dents who are involved in this large-scale “ex- 
periment” toward television instruction. 

In an earlier study by Evans (4), it was 
reported that before the University’s educa- 
tional television station went on the air, 54% 
of a randomly selected sample of students 
stated a preference for a course taught at 
least partly by television over the same course 
taught entirely in the traditional classroom 
manner, if they had their choice of these two 
methods of presentation at an equally con- 
venient time. The remaining 46% stated a 
preference for the traditional classroom in- 
struction. 

The present paper deals with an investiga- 
tion of students’ attitudes following the com- 
pletion of a television course. Two problems 
will be dealt with: (a) To what extent will 
students who complete a psychology course, 
lectures of which are offered via television, be 
sufficiently satisfied with television instruc- 
tion ‘to be interested in taking another col- 
lege course in which television instruction is 
involved; (&) what is the importance at- 
tached by such students to on-campus class 
discussions with the instructor as an adjunct 
to the television lectures. 


Method 


Subjects. To investigate the attitudes of students 
toward television instruction, 74 Ss enrolled in an 
evening section of elementary psychology were used 
During the registration period, there was no an- 
nouncement of the fact that the course would con- 
sist of two weekly 45-minute lectures on television 
plus two weekly 45-minute discussion meetings with 
the instructor on the campus. This control, not 
mentioning the proposed use of television, was in- 
cluded in order to include Ss who would have 
avoided enrolling in this section of the course if 
they knew ahead of time that the course was to be 
taught partly by television, or for that matter was 
anything other than the traditionally conducted ele- 
mentary psychology course that was regularly offered 

Upon being told of the instruction method to be 
used, some comments from the class reflected dis- 
satisfaction, but a comprehensive discussion of the 
problem by the instructor led to an acceptance of 
the instruction plan on at least a trial basis. In this 
respect it is interesting to note that the number of 
“drops” ultimately recorded in the course was no 
greater than is normally expected in a traditionally 
taught evening section, and in no instance was tele- 
vision instruction cited as a reason for dropping the 
course. 

Procedure. On the evening of the final examina- 
tion for the course, prior to being handed the ex- 
amination, Ss were instructed to answer the follow- 
ing two questions with the responses “Yes,” “No,” 
or “Undecided”: (a) Do you feel that you would 
enroll for another college course that used television 
as an instruction medium? (6b) Do you feel that 
the on-campus discussion sessions with the instruc- 
tor as an addition to the television lectures added 
anything appreciable to your learning in the course? 

Also provided on the answer forms, was space to 
answer the question, “Why,” following each question 
and “General comments” concerning their feelings 
regarding the course in general 


Results and Discussion 


With respect to the first question concern- 
ing their willingness to take another course 
which utilized television instruction (74 Ss), 
70% responded “Yes” (favorable), 13% 
“No” (unfavorable), and 16% “Undecided.” 
The answers to the “Why” question for the 
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Ss answering ‘Yes” 
following ideas: 


reflected primarily the 
(a) For large lecture-type 
classes, television is an ideal way to main- 
tain uninterrupted lecture continuity; (0) 
the organization and quality of a television 
lecture is superior to a regular classroom lec- 
ture; (c) not all kinds of courses would prob- 
ably lend themselves to television lecturing, 
but the respondents would enroll for courses 
that did lend themselves to television lectur- 
ing (e.g., large-enrollment courses of the ele- 
mentary type). 

The answers to the “Why” question for the 
Ss who answered “No” or “Undecided” re- 
flected essentially these ideas: (a) Television 
instruction does not allow the opportunity for 
questions by students directed to the instruc- 
tor during the course of lectures; (5) tech- 
nical production, transmission, or reception 
difficulties interfere with comprehension of 
lecture material at times; (c) interruptions 
from other viewers of the lecture or from 
other sources, at times, may interfere with 
comprehension of lecture material. 

The responses from 71 Ss (3 Ss failed to 
respond) to the second question which was 
concerned with how valuable the Ss felt that 
the on-campus discussion section with the in- 
structor were tallied as “Yes” (favorable) in 
42% of the instances, “No” (unfavorable) in 
56%, and only 2% were “Undecided.” Here 
the answers to the “Why” questions in the 
“Yes” group may be summarized as follows: 
(a) Made it possible to clear up things that 
weren't clear in the lecture; (&) personal con- 
tact with the instructor increases interest and 
learning. 

The answers to the “Why” questions of the 
“No” group, that is, the group that felt that 
the on-campus discussion as a supplernent 
didn’t add anything appreciable to their learn- 
ing in the course, may be summarized as fol- 
lows: (a) The lectures themselves were com- 
plete enough to deal with the subject matter; 
(b) the size of the discussion group was not 
small enough to be effective. 

In general, these findings suggest that: (a) 
The experience of actually taking a course in- 
struction in which is at least partly presented 
by television is sufficiently satisfactory to most 
students to encourage them to take another 


course involving television instruction; (6) a 
considerable number of such students, how- 
ever, feel that some contact with the instruc- 
tor in the form of class discussions following 
the television lecture have some value from 
the standpoint of answering questions and 
face-to-face contact with the instructor in 
general; (c) an approximately equal group of 
students find the television lecture by itself, 
without the class discussion, sufficient for 
their learning needs. 

With reference to b above, the need to have 
questions answered concerning lecture mate- 
rial may be met by a question-answer session 
conducted on television using questions sub- 
mitted by viewers. An evaluation of this 
technique is described in an earlier paper by 
Evans (3). 

However, the over-all impression gained 
from the “open-end” question data, that 
face-to-face contact with the instructor is de- 
sired by many students, suggests that courses 
in which enrollments may be held to small 
numbers probably would be less satisfying to 
students if taught by television. On the other 
hand, in the high enrollment lecture courses, 
often characteristic of first and second year 
courses in many large universities (typified 
by the course dealt with in the present pa- 
per), this same “face-to-face” contact and 
discussion with the instructor is limited to al- 
most the same degree that it is in television- 
instructed courses. Consequently, television 
instruction could be conceived of being no 
less satisfying to the student. In fact, in the 
large lecture courses, the use of public ad- 
dress systems are often used to compensate 
for the actual physical distance of the in- 
structor from many of the students. Here 
television presentations, with their admittedly 
greater intimacy, may be highly desirable to 
students who particularly value a feeling of 
greater proximity to the instructor. 

As is so often the case, however, in ex- 
ploratory research such as is presented in the 
present paper, the results are probably more 
provocative than conclusive. For one thing, 
the data in the present paper are based on a 
particular course situation, and so responses 
to other courses taught by television could be 
quite different. However, the earlier studies 
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reported above which demonstrate that tele- 
vision instruction is probably as effective as 
traditional classroom instruction from the 
standpoint of achievement of students, do 
not consider the problem of attitudes toward 
television instruction as factors in achieve- 
ment. It is possible that favorable or unfa- 
vorable attitudes toward instruction by tele- 
vision as manifested by the responses of Ss 
in the present study may be vital factors in 
achievement in such courses. If this proved 
to be the case, making television-instructed 
courses optional rather than compulsory 
would be a wise administrative procedure. 
Experiences of the writer and other tele- 
vision instructors at the University of Hous- 
ton would tend to support this position. A 
systematic study of this problem, of course, 
would provide the basis for further research 
which the writer plans to complete at a later 
date. 

In conclusion, it may be stated that aside 
from practical considerations such as the 
need for more classroom space, it would ap- 
pear that students enrolled in television-in- 
structed courses may possess favorable enough 
attitudes toward television instruction as a 
result of experiencing such instruction that an 
important future for television as a formal 
course instruction medium is suggested. 


Summary 


The attitudes of Ss toward television in- 
struction enrolled in an elementary psychol- 
ogy course taught partly by television and 
partly through on-campus discussion groups 
were determined. Seventy per cent of the 
Ss revealed an interest in taking another 
course involving television instruction, 13% 
would not do so, and 16% were undecided. 
Among the “open-end” question responses, a 
feeling was revealed that television lectures 
were better developed than traditional class- 


room lectures, involved fewer disturbing in- 
terruptions from students, but lacked op- 
portunities for class participation and were 
subject to technical electronic interruptions. 

The attitudes toward supplementary class 
discussions revealed responses which favored 
them in 42% of the cases and felt them of no 
appreciable importance to learning in addi- 
tion to television lectures in 56% of the 
cases. Two per cent of the Ss were “Unde- 
cided” concerning their value. Among these 
“open-end” question responses it was re- 
flected that their importance centered around 
advantages gained through some face-to-face 
contact with the instructor for those favoring 
them, and the feeling that television lectures 
were sufficiently complete in themselves by 
those who found no need for them. 
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Most validity studies of the Flesch reading 
ease formula (3) have used readership as the 
criterion measure (4) although the equation 
was designed to evaluate comprehensibility 
or the difficulty level of the material read. 
Flesch (2) did employ an estimate of com- 
prehension but compared only two levels 
within his scale. Consequently this study 
was proposed to test the validity of the equa- 
tion in its present form (3) against a test of 
reading comprehension using “popular” mate- 
rial. Comprehension was defined as the abil- 
ity to answer correctly multiple-choice ques- 
tions on a 100-word passage. 


Method 


In the establishment of a reading test it was 
necessary to take into consideration interest value 
and Flesch counts. Samples were drawn from the 
“Life in These United States” section of the 12 
issues of the 1951 Reader's Digest. This section was 
chosen with the hope that the anecdotes would be 
interesting to high school seniors and American 
adults, generally. 

The articles were rewritten, if necessary, to have 
five 100-word articles at each of 10 points of diffi- 
culty, ranging at 10-point intervals between 5 and 
95. The samples were presented in a random order 
to 10 graduate students at the University of Denver 
who judged them for interest in terms of a graphic 
rating scale extending from “among the most inter- 
esting anecdotes I’ve read” (0) to “of no particular 
interest” (4). 

From each reading-ease (RE) level, one article 
was chosen with a mean interest rating between 1.4 
and 1.7: “more interesting than most anecdotes but 
not as much as some.” Six multiple-choice items 
with five possible alternatives were written for every 
one of the 10 passages selected. Although difficulty 
of the questions was not objectively analyzed since 
the formula does not yield estimates of phrases or 
single words, the attempt was made to use “easy” 
and equally difficult wordings. 


1 This article is based on a thesis presented in par- 
tial fulfillment for the degree of Master of Arts at 
the University of Denver under the name of Mar- 
garet Jean Stewart. The author is indebted to Dr. 
Alfred B. Shaklee for constructive criticism and help- 
ful guidance. 


The tests were administered to 101 high school 
students in the first half of their senior year. Sub- 
jects were tested in three groups of 25 and one of 
26 students during their “administrative” periods. 
The high school chosen, North, was classified by the 
Board of Education as Denver’s most heterogeneous 
The range between the first and the third quartiles 
of 1Q’s obtained from the Henmon-Nelson test of 
intelligence (courtesy of North High School) 
92.5-114.5; the ages of the sample, 17 
months to 17 years, 7 months. 

Before having the subjects read the 10 
paragraphs, the Survey Section of the Diagnostic 
Reading Tests (DRT, 1) was administered, using 
standard time limits (15 minutes each), in order to 
be able to compare the Comprehension Test devel- 
oped for this study with a standardized test. The 
General Reading and Comprehension subtests of the 
Survey Section: Upper Level (Grades 7 through col- 
lege freshmen), Form A were used. In the stand- 
ardization, scores on General Reading and Com- 
prehension were added to receive a single score of 
reading comprehension. The Vocabulary subtest was 
not administered. 

In the administration of the Comprehension Test 
constructed for this experiment, slips with the first 
anecdote (RE = 95) were distributed face down. On 
signal from the experimenter, the subjects turned 
them over and were allowed to read for one minute. 
The slips were then turned face down and collected. 
Questions for each paragraph were distributed and 
collected in the same manner, allowing one minute 
for answering the questions. The anecdotes were 
administered in descending order from the easiest 
(RE = 95) to the hardest (RE = 5). 


was 
years, 3 


selected 


Results and Discussion 


Table 1 presents the comparisons of the 
mean number of questions answered correctly 
for adjacent levels of reading ease scores. 
The significance of these differences was 
tested, using the ¢ test for pair differences. 
Six of the nine differences were significant 
beyond the 5% level of confidence; three did 
not reach statistical significance. 

The correlation of .69 between the Com- 
prehension Test and the DRT is significant 
at the .001 level using the ¢ test for signifi- 
cance of r (t = 9.48). The Comprehension 
Test, consisting of passages with approxi- 
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Table 1 


Comprehension Scores at R.E. Levels 





we 


Mean SD t 
0.84 
0.98 
0.79 
0.93 
0.84 
0.72 
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0.91 
1.13 
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* Significant at 5° confidence level. 
** Significant at 1% confidence level. 


mately the same adjudged interest value, 
bears sufficient relationship to the standard- 
ized test to warrant the assumption that it 
measures, to a certain extent, the same skills. 

Although the present investigation does not 
constitute a validation against comprehension 
scores of a general adult population, it en- 
ables us to assume more safely than before 
that the formula would have satisfactory dis- 
criminatory power in terms of the abilities 
and performance of such a group. 

The results of this study, using comprehen- 
sion as a criterion and ten levels of difficulty, 
agree with the investigations of other adult 
reading material involving readership and 
with the study employing two levels of the 
Flesch scale. Thus, the findings lend sup- 
port to the current use of the technique as an 
adequate measure of the relative difficulty of 
“popular” reading material. 


Summary 


This investigation was undertaken to com- 
pare Flesch readability scores with a test of 
reading comprehension using “popular” read- 
ing material. Fifty anecdotes from the “Life 


in These United States” section of the 1951 
Reader’s Digest were condensed and rewrit- 
ten to yield five 100-word passages at each 
of 10 points of difficulty, ranging equidis- 
tantly from 5 to 95. From each RE level, 
one article was selected that had approxi- 
mately the same interest rating as adjudged 
by 10 graduate students on a graphic rating 
scale. Six multiple-choice items with five 
possible alternatives each were written for 
every one of the 10 passages chosen. These 
10 paragraphs and their questions were ad- 
ministered to 101 Denver high school seniors, 
allowing one minute for reading the selection 
and one minute for answering the questions, 
after the General Reading and Comprehen- 
sion subtests of the Survey Section: Upper 
Level of the Diagnostic Reading Tests had 
been given. 

The Comprehension Test developed for this 
study with approximately equal interest value 
per paragraph bears a statistically reliable re- 
lationship to the standardized DRT. In gen- 
eral, differences between mean comprehension 
scores for adjacent RE levels were significant 
at the 5% level. From these findings it is 
inferred the Flesch RE scores do adequately 
estimate the comparative difficulty in com- 
prehension of “popular” reading material for 
a 17- to 18-year-old group. 


Received March 9, 1955. 
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Andrews (1) has recently proposed a style 
of typography called the “square span.” The 
material is arranged in double-line blocks as 
follows: 

This is 
an example 


of the 
square span 


style of 
presentation 


This style would utilize both the vertical and 
horizontal visual span, and thus could con- 
ceivably be an aid to reading speed. The 
material is grouped into thought units also, 
and this might be a further aid to compre- 
hension. 

Another style of typography that Andrews 
proposed is the “spaced unit.” 

This is an example of the spaced unit 
style of presentation. 


Here it can be seen that the material is 
grouped into thought units also, although it 
is more like the standard typography than 
the “square span.” 

Problem. It was the purpose of the experi- 
menter to compare the span of visual com- 
prehension for the square span, the spaced 
unit, and the conventional typographical ar- 
rangement by presenting tachistoscopically 
sentences printed in all three styles. 


Method 


Design of the experiment. Three arrangements of 
words, each arrangement being represented by 20 
sentences, were used. The material presented was 
approximately equal in difficulty and length for each 
of the three arrangements. The difficulty of the 
sentences was subjectively determined. The experi- 
mental design is represented in Table 1. The letter 
A represents Arrangement 1, the presentation of sen- 
tences of words arranged in one line horizontally 
across the exposure field. This is illustrated by the 
arrangement of the following sentence: 


The street was not well paved. 


This is the conventional style. The letter B repre- 
sents Arrangement 2, the presentation of sentences of 
words arranged in the same manner as in Arrange- 
ment 1, except that a space separates the sentence 


“ ’ 


into two “natural phrases.” This is illustrated by 
the arrangement of the following sentence: 


Do not leave his glove here. 


This is called the spaced-unit style. The letter C 
represents Arrangement 3, the presentation of sen- 
tences of words arranged with the second natural 
phrase appearing below the first. This is illustrated 
by the following sentence: 


The two boys saw 
the car pass. 


This is called the square-span style 


As a check on the equality of the difficulty of the 
sets of sentences, a sample was taken from each set 
of 20 sentences. The Flesch formula for difficulty of 


Table 1 


Order of Presentation of Three Typographical 
Arrangements to Six Groups of Ss in a 
Visual Apprehension Experiment 








Order of Presentation 


Group 2nd 3rd 


I 


II 
III 
IV 
V 
VI 


comprehension! was applied, giving the following 
indices of readability: the conventional style, 107.9; 
the spaced-unit arrangement, 101.9; and the square- 
span style (which showed the significant advantage 
in the experiment), 102.8. This indicates that all 
sets of sentences were extremely easy in compre- 
hension, the differences among these groups of sen- 
tences in average sentence length and number of 
syllables being negligible. 

Thirty subjects (Ss), college students ranging in 
age from 18 to 39, were used in the experiment. All 
Ss were volunteers. They were divided into six 
groups of five each, each group being presented with 
all three arrangements in a different order as shown 
in Table 1. This enables the presentation of each 


1 206.84 — .85 (number of syllables per 100 words) 
— 1.02 (average sentence length). 
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arrangement an equal number of times and also an 
equal number of times to each S. It also provides 
for the presentation of each arrangement an equal 
number of times first, second, and third for the same 
numbers of Ss, each condition being presented first, 
second, and third for 10 Ss. 

All the sentences were exposed for a period of 100 
msec. in the Dodge mirror tachistoscope. 

Each S received a practice period of five sentence 
presentations before each arrangement. 

A ready signal was given approximately 2 sec. prior 
to each exposure. Immediately after an exposure, 
the S reported what he perceived. This was re- 
corded by the experimenter. 

Each S was given a point for each word correctly 
reported irrespective of whether the order was cor- 
rect. 


Results 


The mean comprehension score in words 
per sentence and the standard deviation for 
each condition are as follows: conventional, 
M = 3.032, SD = .829; spaced unit, M = 
2.947, SD = .792; square span, M = 3.722, 
SD = 1.63. 

The quantitative comparisons are shown in 
Table 2. These comparisons show a statisti- 
cally significant difference in favor of the 
square-span style of typography over both 
the conventional and the spaced-unit style. 
The difference between the spaced-unit style 
and the conventional style was not statisti- 
cally significant. The overlap scores bear out 
these marked differences. For both the con- 
ventional style and the spaced-unit style only 
20% of the Ss had comprehension scores ex- 
ceeding the mean of the square-span style. 
For the spaced-unit style (the lowest group) 
exactly 50% of the Ss had scores exceeding 
the mean for the conventional style, indicat- 
ing no appreciable difference between these 
two conditions. It can be seen that the inter- 
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correlation between comprehension in all of 
these conditions is substantially high, the cor- 
relation coefficients ranging from .63 to .78. 
All the coefficients are statistically significant. 

In comparing the square-span condition 
with the conventional condition, it was found 
that 20 Ss'did better under the square-span 
condition, one S did equally well under both 
conditions, ‘and nine Ss did better under the 
conventional condition. In comparing the 
square-span condition with the spaced-unit 
condition it was found that 25 Ss did better 
under the square-span condition and five Ss 
did better under the spaced-unit condition. 
This shows a tendency for more individuals 
to have a larger span for the square-span con- 
dition than for either of the other two. When 
the conventional condition is compared with 
the spaced-unit condition, it was found that 
10 Ss did better under the conventional con- 
dition, three did equally well under both con- 
ditions, and the remaining 17 did better un- 
der the spaced-unit condition. Thus, although 
the differences are significant in favor of the 
square-span style, it can be seen that a sub- 
stantial percentage of the Ss did better on 
each of the other two arrangements than on 
the square-span arrangement. 

In view of this evidence, great caution 
should be exercised in accepting the square- 
span style as being unequivocally superior to 
the other styles investigated in this experi- 
ment. Comprehension under the square-span 
condition seems to involve an adaptation to 
new habits of perceptual grouping and read- 
ing that are contrary to previous habits, judg- 
ing from the remarks of many Ss. Therefore, 
the experimenter feels that there is a distinct 
possibility that the use of well-practiced Ss 


Table 2 


Statistical Comparisons and Product-Moment Correlations Between the 


Three Typographical Arrangements 





Conditions Correlated and 
Compared 


Conventional and spaced-unit style 
Square-span and spaced-unit style 
Square-span and conventional style 





** Significant at the 1% level. 


Difference 

Between 
N Means | 
30 O85 
30 7175 
30 .690 


87 
- 906** 
2.89** 


Note.—The mean difference is always in favor of the condition on the left. 
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would tend to increase the difference between 
the square-span and other styles. 

The spaced-unit style showed a slight in- 
feriority to the conventional style, although 
the difference was not statistically significant. 
There are a number of reasons why the 
spaced-unit style may not be of any particu- 
lar advantage. For one thing, the introduc- 
tion of the long space necessitates the spread- 
ing of the print over a longer length of the 
visual field and hence farther into the field 
of indirect vision on either side. This would 
tend to make comprehension less efficient. 
In addition, the space might not separate the 
material into units that would be convenient 
groupings for every reader. At any rate, the 
use of phrase grouping does not seem to fa- 
cilitate comprehension in this case, and the 
advantage of the square-span style would 
seem to be solely a result of an increased 
utilization of the vertical visual span. 

If the trends shown in this experiment were 
to hold under actual reading conditions, it is 
conceivable that printed matter could be read 
up to 25% faster when printed in the square- 
span style than when printed in conventional 
style. However, factors in reading other than 
the span of comprehension should be consid- 
ered before one becomes overly optimistic 
about the possibility of increased reading 
speed. This problem calls for further re- 
search, particularly under actual reading 
conditions. 

The implication for advertising may also be 
worth considering. In selling a product, ad- 
vertisers are well aware oi the importance of 
a presentation that “reads fast.” The re- 


sults of this study indicate that a use of the 
vertical as well as the horizontal span might 
contribute to a rapid and easy comprehension 
of the producer’s message. A further test of 
this idea in the advertising situation might 
prove valuable. 


Summary and Conclusions 


1. In a study designed to investigate the 
relationship between the span of comprehen- 
sion and typography, the following three 
styles of ‘typographical arrangement were 
compared: the conventional style, the spaced- 
unit style, and the square-span style. 

2. The square-span style yielded compre- 
hension spans significantly superior to both 
of the other styles investigated. 

3. The other two styles yielded compre- 
hension spans not significantly different from 
each other. 

4. All of the styles showed a high degree 
of interrelationship for all the subjects. 

5. The data indicate that the greater uti- 
lization of the vertical visual span was the 
factor that yielded superior comprehension 
scores for the square-span style. 

6. The square-span style used in books and 
advertising might lead to increased reading 
and comprehension speed. This warrants 
further investigation. 
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Fatigue and the Perceptual Field of Work 
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USAF School of Aviation Medicine 


The approach of Bartlett and his colleagues 
to the problem of fatigue is largely respon- 
sible for the importance of their findings (1, 
2). Not wishing to extrapolate results de- 
rived from simple and highly repetitious per- 
formance to the more complex levels of work 
behavior, they brought a highly skilled work 
task within the full control of the laboratory 
(3). This was provided by the now familiar 
Cambridge Cockpit which was designed to 
simulate instrument flying and which per- 
mitted the graphic recording of control and 
instrument movements. By determining the 
proficiencies with which the different task 
components were executed and analyzing their 
relative changes as time at the task con- 
tinued, it was possible to observe the progres- 
sive disorganization of the integrated pattern 
of skilled behavior. 

As a consequence, Bartlett (1) has dis- 
cussed certain consistently occurring and in- 
teractive processes which he attributes to fa- 
tigue following highly skilled work and which 
are responsible for the progressive deteriora- 
tion of proficiency. In the main, these are: 
(a) unnoticed increase in the range of toler- 
ated error; (6) gradual loss of timing, i.e., 
the tendency to execute the right response 
but at the wrong time; (c) disintegration of 
the stimulus field; and (d) subsequent dis- 
sociation of the response pattern. 

The effects of such processes, singly or col- 
lectively, upon the performance of a highly 
complex perceptual-motor task leave little to 
speculation. Less lucid, however, is their eti- 
ology. These deteriorative processes might be 
due to prolonged work at the task, as is im- 
plied, or possibly to other factors characteriz- 
ing the particular experimental situation. 

Especially puzzling is the process involving 
the stimulus field. On the basis of observed 
differential decline in the accuracy and ap- 
propriateness of responding to the individual 
components making up the perceptual field of 
work, Bartlett reports (1) that the “splitting 


up” of the integrated stimulus field proceeded 
regularly from the margin to the center. This 
finding could be of significance for practical 
and theoretical considerations of industrial fa- 
tigue since it suggests the possibility of pro- 
gressive constriction of the perceptual field of 
work as a function of prolonged attendance 
to the field. It is well known that progressive 
loss of peripheral vision follows the gradual 
development of hypoxemia and it is not too 
rash to presume that with sustained attend- 
ance in such a task, hypercapnia and its in- 
verse covariant, hypoxemia, may be inevitable 
consequences. 

It is also possible that the observed phe- 
nomena might have resulted from the effects 
of two psychological factors, differential habit 
strength and frustration. With respect to the 
former, it is reported (1) that the instru- 
ments which were located centrally on the 
panel required continuous attention, the in- 
struments marginally located required much 
less and periodic attention, and still others 
required irregular and occasional attendance. 
The latter were the first “to break away from 
the rest.” Undoubtedly, some learning of 
monitoring procedure was necessitated by this 
simulation of Might and so it is reasonable to 
expect greater proficiency in attending the 
central instruments. However, as practice 
continued, attendance of the marginal instru- 
ments should also exhibit improvement. This 
possibly would have occurred had it not been 
for the coexistence of frustration resulting 
from two severe impositions. The Ss were 
pilots with flying experience; and since the 
similarity between an aircraft and the Cam- 
bridge Cockpit was far from complete, it 
seems inevitable that habit interference would 
adversely affect performance in the latter. 
Augmenting this imposed difficulty was the 
exactness required in the experimental situa- 
tion. It was purposely of greater degree than 
that required in actual flight. If, therefore, 
S was prevented from attaining the standard 
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of performance which he had established on 
the basis of his flight experience, there is 
little doubt that the simulated flight was a 
frustrating experience. Support of this pos- 
sibility is evidenced in the later investigations 
conducted at the Cambridge Laboratory by 
Davis (4, 5, 6, 7) who describes the general 
disorganization of skill in terms of two suc- 
cessive phases, an “overreaction to” followed 
by a “withdrawal from” the experimental 
situation. The former was reflected by ir- 
reguiar distribution of attention to various 
aspects of the task, excessive movements of 
the controls leading to overcorrection, in- 
crease in restless movements, and a greater 
emotional disturbance with irritability and ag- 
gression directed to the apparatus. Although 
dissatisfied with their performance, Ss seemed 
determined to improve it. Since it was ex- 
ceedingly difficult te achieve improvement, 
determination to do so dissipated, occasion- 
ing the withdrawal phase. The Ss relaxed, 


accepted certain errors as beyond their re- 
sponsibility, and were either satisfied with or 
undisturbed by a lowered standard of per- 
formance. 

Clearly, then, if the generality of the proc- 


ess concerned is to be properly established, it 
must first be determined whether its occur- 
rence is primarily dependent upon sustained 
and prolonged work or upon the psychologi- 
cal factors which have been discussed. To- 
ward this end, an attempt was made to de- 
termine if attending to a perceptual field of 
work for a prolonged period would induce 
differential decline in the proficiency with 
which the marginally and centrally located 
components were attended and if such de- 
cline could be modified by certain experi- 
mental conditions. 


Method 


Subjects. The 168 Ss were volunteer, basic air- 
men whose activities were confined to a restful regi- 
men for a period of 22 hr. prior to their participa- 
tion in the investigation. 

Apparatus. The work task was provided by the 
USAF SAM Multidimensional Pursuit Test (9) 
which requires continuous attendance to an instru- 
ment panel containing four simulated aircraft instru- 
ments. Two of these whose separation determines a 
visual angle of approximately 17° are vertically po- 
sitioned in the center of the panel. Equidistantly 


positioned on either side of the lowermost of the 
centrally located instruments are the other two in- 
struments. The separation of these latter margin 
ally located instruments determines a visual angle of 
approximately 28°. From observation and interroga- 
tion of Ss, it was found that they developed the 
habit of fixating either the upper or lower of the 
two center instruments and would either depend 
upon peripheral vision for the detection of move- 
ment of the other instruments or would quickly 
“check” these latter instruments and return their at- 
tention to the central instruments. Behaviorally, 
then, the situation appeared to be one involving 
centrally and peripherally located instruments. 

During a work trial, these instruments are auto- 
matically made to drift randomly from their null 
positions. The S’s task is to maintain concurrently 
all instruments within their respective ranges of 
tolerated error! by the timely and appropriate ma- 
nipulation of simulated aircraft controls. Proficiency 
is rated by the total tirne per trial that all instru- 
ments are so maintained. In order to determine the 
proficiency with which the individual instruments 
were controlled and thereby permit an analysis of 
the relative rates of proficiency decline for the cen- 
trally and marginally located instruments, the appa- 
ratus was modified so as to record electrically the 
total time per trial that each instrument had been 
maintained at null. A cycling system meters out 
alternate 1-min. work trials and 15-sec. rest periods 
and, in addition, regulates for each instrument the 
frequency of its departure from null. In a cycle of 
eight 1-min. work trials, these frequencies are equal 
for all instruments. 

The task provided by this apparatus, as contrasted 
to that of the Cambridge studies, differs in certain 
important respects. Since events occur at each of 
the instruments at an equal rate, the development 
of differential habit strength is not likely. More- 
over, induced frustration should not be severe. Un- 
realistic proficiency is not required of Ss and since 
they are experimentally naive and without flying ex- 
perience, interference with the acquisition of pro- 
ficiency should be negligible. It is proposed, there- 
fore, that if the rates of decline in the proficiency of 
individual instrument control are generally found to 
be different, then the differences may be accounted 
for by the single factor of prolonged work. 

Experimental conditions. There were three addi- 
tional variables and their respective conditions are 
as follows: 5 . 

1. Pharmacological treatments, orally administered 
in capsule form: (a) Dexedrine, single dose (5 mg 
d-amphetamine sulfate); (b) Dexedrine, repetitive 
dose (timed disintegration with 4-hr. peak release of 
approximately 6144 mg. d-amphetamine sulfate per 
disintegration) ; (c) caffeine derivative; (d) Bena- 
dryl-Hyoscine-Dexedrine (50 mg. diphenhydramine 


. 


hydrochloride, .65 mg. scopolamine, 5 mg. d-amphe- 


1 During the time that the instrument indicator is 
within its narrow band or range of tolerated error, it 
is said to be at null or null position. 
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tamine sulfate); (e) Benadryl-Hyoscine (50 mg. 
diphenhydramine hydrochloride, .65 mg. scopola- 
mine); (f) placebo (lactose); and (g) a control 
group receiving no drug. 

2. Different systems of information as to when all 
four instruments were being concurrently maintained 
at null. This information was to be primarily given 
by either (a) the customary methods of monitoring 
the perceptual field; or (b) a visual signal that was 
given by an unobtrusively positioned lamp which 
illuminated the cockpit floor; or (c) an auditory 
signal consisting of a 400-cycle tone transmitted 
through a lightweight monaural receiver. 

3. Different goal proximities as determined by 
knowledge of the length of work task. Before the 
start of the work task, half of the Ss were in- 
formed, “You will be required to verform this task 
for a period of four hours after which you will be 
given a ‘break.’” The other half of the Ss were in- 
formed, “You will be required to perform this task 
for a period of seven hours.” Actually, both groups 
were given a 15-min. rest at the end of 4 hr. of work. 

Since these experimental conditions were expected, 
and have since been shown (8), to effect different 
levels of proficiency throughout the task of control- 
ling all instruments concurrently, their effects upon 
the proficiency of individual instrument control 
might be expected to contribute to a better under- 
standing of the problem under present consideration. 
For example, the decrement normally occurring in 
the concurrent cortrol of instruments was found to 
be profoundly allayed and augmented by the ana- 
leptic and depressant drugs, respectively. If con- 
striction of the perceptual field can be inferred to 
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occur as a function of prolonged work, to what ex- 
tent is this process subject to the level of cerebral 
efficiency as determined by pharmacological means ? 
In the case of the different information systems, it 
was found that the use of the supplementary signals 
resulted in greater proficiency of concurrent instru- 
ment control throughout the entire period of work 
This was believed due to the greater efficiency of 
monitoring afforded by these supplementary signals, 
i., S could rely upon these cues to indicate the de- 
parture of the marginally located instruments from 
null position instead of having to obtain this infor- 
mation through the use of peripheral or direct vision 
If this is true, then, these supplementary sources of 
information should act to prevent any possible con 
striction of the perceptual field. Lastly, the condi- 
tion involving the more proximate goal resulted in a 
greater proficiency of concurrent instrument control 
which was evidenced from the first to the last cycle 
of work. If this is attributable to a greater expendi 
ture of effort, what would be the effects of such ex 
penditure upon the integrity of the perceptual field 
of work? 

Procedure. All Ss received, under common condi 
tions, 40 trials (50 min.) of practice at the task 
Immediately following this, the above-mentioned ex 
perimental conditions were introduced and Ss began 
the initial period of work (9:30 a.m.-1:30 P.M.). 
At the completion of this 4-hr. period, all Ss were 
given a 15-min. rest during which they left the test 
ing rooms, relieved themselves, and ate a light lunch 
They then returned to the apparatus formerly oc- 
cupied and began the final period of work (1:45 
P.M.—4:45 P.M.). 
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The Ss which were assigned in a predetermined 
random order to the 42 cells (n = 4 per cell) yielded 
by the factorial design, were tested four at a time 
under conditions that prevented observation or 
knowledge of one another’s performance, and were 
treated alike for any one testing day. Illumination, 
temperature, and humidity were maintained at con- 
ditions considered optimal 


Results and Discussion 


The general outcome is clearly revealed by 
Fig. 1. Here, the mean time per cycle (8 
work trials) that each instrument was main- 
tained at null position as well as the mean 
time per cycle that all were so maintained 
concurrently is plotted for the practice and 
7-hr. work periods. During the former, rates 
of learning are approximately equal despite 
the apparent differential difficulty in the con- 
trol of these instruments. Comparing the 
curves of these four instruments with the 
lowermost curve discloses an estimate of the 
difficulty of the task assigned to S. The in- 
struments are individually maintained within 
the limits of tolerated error for a little more 
than one-half of the total working time. Con- 


currently, they are maintained within these 
limits for only about one-fourth of the total 


working time. 

During the 7-hr. work period, it is evident 
that decline in proficiency progresses at about 
the same rate for all individual instruments. 
At no time do the marginally located instru- 
ments, Air Speed and RPM, exhibit greater 
deterioration of proficiency than do the cen- 
tral instruments, Bank and Turn. The paral- 
lelism existing between the individual in- 
strument curves throughout the entire work 
period is seen to exist also between these 
curves and the concurrent control curve. 

Having noted no general tendency for the 
rates of decrement in individual instrument 
control to become differentiated as a function 
of prolonged work, it is now necessary to es- 
tablish statistical significance and also to de- 
termine if any of the experimental conditions 
might have occasioned results to the contrary. 
Accordingly, an analysis was performed in ac- 
cordance with a split-split-plot design where 
Drugs, Information Systems, and Goal Prox- 
imity constitute the independent experimental 
variables, and the four Instruments and se- 


lected Cycles are treated as replications. By 
such an analysis it is possible to test for the 
significance of the three-way interaction of 
each of the independent experimental vari- 
ables X Instruments xX Cycles. Depending 
upon the level of significance, one can con- 
clude whether or not nondifferentiation in de- 
crement rates, such as exhibited by Fig. 1. 
is characteristic of all conditions making up 
each experimental variable. The cycles se- 
lected for analysis were 10, 13, 16, 19, 28, 31, 
34, and 37. These are representative of pro- 
ficiency during what may be considered to be 
the critical portion of the prolonged task. In 
addition, these are not likely to reflect to any 
serious degree certain effects which might be 
attributable to the anticipation of rest, the 
rest period, or the anticipation of final termi- 
nation of work. For each of these selected 
cycles, the individual instrument proficiency 
values were adjusted for regression upon their 
corresponding pre-experimental values. The 
variance of these adjusted values were then 
analyzed in the manner that has been de- 
scribed. 

Table 1 presents these results. Of these, 
it is sufficient to consider only those interac- 
tions relevant to the questions which have 
been raised. Consider first the Cycles x In- 
struments interaction. The fact that it does 
not approach a critical level of significance 
confirms the validity of the general finding 
presented in Fig. 1. Second, of the three- 
way interactions explained in the preceding 
paragraph, only one, the interaction involving 
the Drug variable (CID), can be considered 
to be significant (P = .01). Further exami- 
nation of the data revealed that the signifi- 
cance of this interaction is not due to differ- 
ent rates of proficiency decline in the control 
of the marginally and centrally located in- 
struments but to the fact that for one of the 
cycles, differences in proficiency of individual 
instrument control were not uniform from one 
drug group to another. During Cycle 37 an 
abrupt and substantial proficiency loss in the 
control of the Turn instrument was evidenced 
by the two groups which had received Dexe- 
drine. Third, the fact that the interactions, 
Cycles X Instruments x Feedback and Cy- 
cles X Instruments X Goal, are not signifi- 
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cant may be taken to indicate that differen- 
tial rates of proficiency decline in the control 
of the individual instruments did not occur 
within any of the conditions determined by 
these two independent variables. 

While these results appear to demonstrate 
that prolonged attention to a complex percep- 
tual field of work does not occasion differen- 
tial rates of proficiency decline such as were 
noted by Bartlett, the generality of this find- 
ing remains subject to certain restrictive con- 


Table 1 


Analysis of Adjusted Variances 


Mean 
Square F 
3,104,749.7 35.91 
947,110.7 10.95 
469,990.8 5.44 
DF 74,408.9 
DG 113,204.4 
FG 118,758.0 
DFG 128,906.6 
Subjects treated 
alike 


Source df 


Drugs 
Feedback 


Goals 


86,461.5 


Instruments 2,964,334.2 264.90 
ID ; 10,247.4 92 
IF 33,735.5 3.01 
IG . 7,956.9 1 
IDF 12,818.6 
IDG 12,396.5 
IFG 6,379.0 
IDFG 9,708.4 
SxXI 11,189.3 


525.80 
19.23 
2.79 
5.90 


Cycles 1,506,992.6 
CD 55,112.7 
CF 8,000.5 
CG 16,915.1 
CDF 6,450.9 2.25 
CDG 15,302.5 5.34 
CFG 8,304.5 2.90 
‘DFG 10,203.0 
‘I 3,655.7 
5,301.5 
2,685.7 
3,509.6 
2,295.7 
2,268.1 
1,294.7 
2,985.9 
2,866.1 
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siderations. In addition to the previously 
discussed differences between the work task 
of the present study and that provided by the 
Cambridge Cockpit, other differences exist. 
First, the latter involved a higher order of 
integration by virtue of the linkage between 
some of the perceptual-motor components 
making up the task and it may well be that 
such integration is a necessary requirement 
for the occurrence of the phenomena in ques- 
tion. Yet, this possibility runs counter to a 
seemingly reasonable assumption; to wit, the 
more integrated the perceptual field of work, 
the more resistant should that field be to dis- 
integration provided, of course, that the in- 
dividual is skilled in monitoring the field. 
Second, the Cambridge Cockpit required at- 
tendance to a greater number of instruments, 
hence, the greater the diversity and area of 
the perceptual field of work. It could be 
argued that differential rates of proficiency 
decline might not be expected to occur in the 
present study since the task involved a sim- 
pler and smaller field of visual displays. To 
compensate for this acknowledged discrep- 
ancy, Ss of the present study were required 
to work for quite a long period of time. De- 
spite the duration of work, differential rates 
of proficiency decline did not occur. Per- 
haps even more crucial is the fact that some 
of these Ss were required to work while un- 
der the influence of the cerebral depressant, 
Benadryl-Hyoscine. Figure 2 presents for 
these Ss the mean time per cycle that each 
instrument was maintained at null as well as 
the mean time that all were concurrently so 
maintained. Considerably greater decline in 
proficiency is evidenced by these curves as 
compared with those of Fig. 1. Yet, during 
the latter portion of the work period where 
fatigue and pharmacologically induced cere- 
bral depression have rendered S largely in- 
capable of maintaining all instruments within 
their respective ranges of tolerated error con- 
currently, his attendance to the marginally 
located instruments does not seem to suffer 
any greater deterioration than does his at- 
tendance to the central instruments. 

On the basis of the evidence presented, it 
can be concluded that prolonged and exacting 
attendance to the movements of individual 
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Proficiency of individual and concurrent instrument control for Ss receiving Benadryl-Hyoscine 


(N = 24). 


instruments does not lead to differential de- 
cline in the accuracy and appropriateness of 
responding to these components of the per- 
ceptual field of work. It seems likely that 
the finding reported by Bartlett may be at- 
tributed to the psychological factors of dif- 
ferential habit strength and frustration in- 
duced by task characteristics. 

Finally, there remains to be considered the 
question concerning the impairment of pe- 
ripheral vision which might reasonably be ex- 
pected to have resulted from fatigue-induced 
changes in cellular metabolism and the ef- 
fects of such impairment upon the percep- 
tual field of work. Since constriction of the 
perceptual field cannot be inferred from the 
. results of the investigation, certain argu- 
ments are indicated. On the one hand, it is 
quite likely that prolonged work of this na- 
ture does not occasion impairment of pe- 
ripheral vision, at least to the extent of af- 
fecting the way in which a perceptual field is 
monitored. On the other hand, and equally 
apparent, is the possibility that impairment 
of peripheral vision does occur but with re- 


sults contrary to what might be expected. In 
support of this latter possibility, the follow- 
ing is submitted: According to our observa- 
tion the method by which an inexperienced 
operator monitors a perceptual field of work 
consists of scanning behavior which brings 
into direct and sequential view each of the 
components making up the field. As experi- 
ence at the task continues, scanning behavior 
undergoes change in accordance with the 
learned employment of peripheral vision. 
The result is greater efficiency with extent 
being dependent upon the degree to which 
this supplementary source of information is 
employed. If attendance is sustained for a 
sufficient period of time, impairment of pe- 
ripheral vision most likely will result. If so, 
the impairment will be perceived by the op- 
erator and compensated for by reversion to 
the earlier methods of scanning. As a conse- 
quence, there will occur a progressive decline 
in the proficiency with which all perceptual 
components are attended but not, necessarily, 
a progressive constriction of this field of 
work. 





G. T. Hauty and R. B. Payne 


Summary 


Proficiencies in the control of several simu- 
lated aircraft instruments were appraised 
throughout 7 hr. of work to determine if the 
control of marginally located instruments 
suffered greater progressive impairment than 
did the control of those instruments located 
centrally on the instrument panel. Progres- 
sive decrement in proficiency occurred for all 
instruments, but the rates of decline were not 
found to be significantly different. It is con- 
cluded that in a similar work situation, dis- 
sociative changes in a field of visual displays 
is not likely to occur as a function of sus- 
tained and prolonged attendance to this field 
of work. 
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A Comparison of Pursuit and Compensatory Tracking in a 
Simulated Aircraft Control Loop 


Rube Chernikoff, Henry P. Birmingham, and Franklin V. Taylor 
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Two tracking display systems are com- 
monly distinguished: (a) pursuit, and (0) 
compensatory. In pursuit tracking, S’s task 
is to position a control marker to coincide 
with a moving course marker. In compen- 
satory tracking, S attempts to keep a moving 
marker aligned with a stationary reference 
point. This moving marker, in the compen- 
satory situation, is continuously positioned 
by reference to the difference between the 
course and control outputs. 

Tracking control systems may take a va- 
riety of forms. Performance on compensatory 
and petformance on pursuit displays have 
been compared using a “position” or “un- 
aided” control system, and also with “aided” 
control. With a position control system, the 
output position of the marker displayed on 
the cathode-ray tube is proportional to the 
input position of S’s control. With aided 
tracking, a movement of the control not only 
causes a proportional change in the position 
of the control marker, but also introduces a 
change in marker velocity, and in some sys- 
tems, marker acceleration as well. 

Studies comparing the pursuit and com- 
pensatory tracking modes with position con- 
trol have shown pursuit to be the more ac- 
curate of the two (2, 3, 4), while a study em- 
ploying aiding (2) has indicated no difference 
between the two tracking displays. 

The advantage accruing to the pursuit dis- 
play has been attributed to the additional in- 
formation available (1). With pursuit, S 
can make use of position, rate, and accelera- 
tion changes in the course, in his control 
marker, and in the error. With compen- 
satory, only position, rate, and acceleration 
changes in the error can be used by S since 
the changes in the course are confounded with 
those produced by S’s .control movements. 

It might be argued that although the better 
information provided by the pursuit display 
could be used to advantage in a position con- 


trol arrangement in which there is no lag in 
the system, the advantage would disappear if 
a delay were inserted into the control loop. 
Such a delay occurs in the case of an air- 
craft, wherein a motion of the stick does not 
produce an instantaneous change in the ob- 
served display. Rather, the response of the 
aircraft bears to the motion of the control 
stick a rather complex relationship which is 
perceived by the pilot as a time lag or a 
“looseness” of control. 

The purpose of the present experiments is 
to compare the relative effectiveness of a 
pursuit and a compensatory display in a 
simulated one-coordinate aircraft control loop. 
Further, since the difficulty of the tracking 
control task might be expected to influence 
the results, course difficulty will also be 
treated as a parameter. Two studies will be 
reported employing different ranges of target 
course frequencies. 


Method 
Apparatus 


Display. A 5-in. cathode-ray tube (CRT), with 
a P-11 short persistence screen, was used to display 
the tracking information. By means of an elec- 
tronic switch, the CRT beam was time-shared to 
produce two spots of light. One spot was shaped 
into a }-in. vertical line, while the other was left as 
a tw-in. diameter dot. During pursuit tracking, the 
vertical line was controlled by S and the dot po- 
sitioned by a course generator. When compensatory 
tracking was used, the dot was fixed in the center 
of the CRT face as the reference, while the vertical 
line was driven by the difference between the course 
generator output and S’s control movements. 

Control. The S positioned the vertical line marker 
by left-right movements of a spring-centered joy 
stick. This stick was 13 in. long, with a spring 
sensitivity such that a force of 0.8 oz. was required 
for a 1-degree stick displacement. Since the maxi- 
mum stick displacement required was about 40 de- 
grees each side of center, the maximum force needed 
was approximately 2 lbs. 

Control loop. To simulate the “looseness” of 
control found in an aircraft control loop, two inte- 
grators, with appropriate gain constants were in- 
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Table 1 
Sine Wave Frequencies Comprising Each 


of the Courses 


Course 


Frequencies (cycles per minute) 


236 1g 
633 

1344 

20 


34 
10 
20 
30 


serted between the display output and the control 
input. A feedback loop was provided around one 
of the integrators, with 15% of the output fed back 
to the input to simulate more closely the aircraft 
response. 

Courses. Four courses, each consisting of a com- 
plex of three sine waves, were used in the studies. 
The frequencies comprising each of the courses are 
listed in Table 1. For all courses, the relative ampli- 
tude of each sine wave was inversely proportional 
to its frequency. This provided an equal maximum 
rate for each frequency of the course. The maxi- 
mum course excursion was across the center 4 in. of 
the CRT face. 

Scoring. Since an AC voltage, which is propor- 
tional in amplitude to the separation between the 
spots on the CRT face, exists at the output of the 
electronic switch, it was possible to obtain directly 
a measure of spot separation. By use of an elec- 


Taylor 


tronic integrator, this voltage was integrated over 
the scoring duration by cumulating a charge on a 
condenser. Integrated error was then read directly 
on a voltmeter. 
Procedure 

The S sat facing the CRT display at a viewing 
distance of approximately 12 in. A fluorescent desk 
lamp was placed behind the CRT and facing the 
wall to provide room illumination with a minimum 
of reflection from the CRT face. The joy stick was 
mounted 3 in. to the right of the CRT centerline, 
with the handle at the seated S’s elbow height. 

Before the experimental sessions were begun, two 
training sessions, with seven practice trials per con- 
dition, were given to familiarize S with the equip- 
ment. For pursuit tracking, S was instructed to 
move the joy stick to the left or right so as to keep 
the vertical line marker on the moving course dot, 
while for compensatory he was instructed to keep 
the vertical line marker on the fixed reference dot. 

In Experiment I, Courses A and B were run with 
both pursuit and compensatory tracking. Five Navy 
enlisted men served as Ss. The four conditions: 
compensatory, Course A; pursuit, Course A; com- 
pensatory, Course B; and pursuit, Course B, were 
presented in randomized order in blocks of six trials 
per condition for 12 daily sessions. One practice 
trial always preceded each block of scored trials. A 
trial was 1 min. in duration, with the last 45 sec. 
scored by the error integrator. The S was given a 
30-sec. rest interval between trials, while the interval 
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of six trials for each of five Ss. 


between successive conditions was approximately two 
hrs. Before each trial, E positioned the course dot 
near the center of the CRT face and randomly 
varied the direction in which the course would start. 

A new group of five Navy enlisted men were used 
in Experiment IJ. Six conditions were presented: 
compensatory, Course B; pursuit, Course B; com- 
pensatory, Course C; pursuit, Course C; compen- 
satory, Course D; and pursuit, Course D. The same 
equipment was used, and the same procedure fol- 
lowed, as described for the first study, but with one 
modification. Since it was impossible to run all five 
Ss daily on all six conditions, three Ss were run on 
all conditions for 12 daily sessions, followed with 
similar runs for the remaining two Ss. 


Results 


Integrated error scores were used as the 
measure of tracking effectiveness with each 
of the two tracking systems. This method 
of scoring provides a number expressing the 


1 
< 6 
SESSIONS 


Tracking error scores for the three courses of Experiment II. 
A is plotted from the data of Experiment I. 


Course 
Each point represents the mean 


total cumulative error integrated over time. 
The measuring units, while arbitrary, are on 
a linear scale. Figure 1 presents the results 
of Experiment I, showing integrated error 
scores for the four experimental conditions 
for each of the 12 daily sessions. Since the 


Table 2 


Experiment I: Comparison Between Tracking Systems 
for Course A and Course B 








Conditions 





Pur- 
suit 
18.8 
46.6 


Compen- 
satory 


20.1 
56.2 


Difference 


1.3 
9.6 
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main interest in the study was comparative 
tracking performance between the pursuit 
and compensatory displays after a fairly 
stable level had been reached, the statistical 
analysis was performed on the data taken 
during the last five days of the experiment 
(Sessions 8-12). 

Table 2 shows comparative performance 
and ¢ values for the compensatory and pur- 
suit conditions on Course A and Course B. 
The data compiled in this table are the mean 
integrated error scores for the five Ss for the 
last five sessions. There was no significant 


difference in tracking performance between 
the pursuit and compensatory systems on 
Course A. However, with Course B the pur- 
suit mode yielded significantly less error than 
did compensatory. 

The data from Experiment II are given in 
Fig. 2. Integrated error scores for the six 
experimental conditions for each of the 12 
sessions are presented. Also shown in this 
figure, for comparison purposes, are the curves 
for Course A, plotted from the data obtained 
in Experiment I. The results of the ¢ tests 
comparing pursuit and compensatory track- 
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and Course A of Experiment I. 
for each of five Ss. 


Log plot of tracking error for the three courses of Experiment II 
Each point represents the mean of six trials 
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Table 3 


Experiment IIT: Comparison Between Tracking Systems 
for Course B, Course C, and Course D 








Conditions 





Pur- 
suit 
33.7 
85.5 

169.4 


Compen- 
satory 
B 54.1 
ts 128.2 
D 243.2 


Difference 
20.4 


42.7 
73.8 


Course 








%9 < Mi. 


ing on Course B, Course C, and Course D, 
are summarized in Table 3. The ¢ tests are 
based on the data obtained during the last 
five sessions of the study. For all three 
courses, tracking error with the pursuit mode 
was significantly less than that obtained with 
compensatory. 

Inspection of Fig. 2 indicates that in Ex- 
periment II as the level of course difficulty 
increased, so did the absolute difference in 
performance between compensatory and pur- 
suit tracking. However, the relative differ- 
ence between the two tracking modes was 
found to be about the same. This is shown 
in Fig. 3, in which the logarithm of the track- 
ing error scores is plotted. The mean pro- 
portional improvement of pursuit tracking 
over compensatory, for scores taken during 
the last five days, was as follows: Course B, 
37.7%; Course C, 33.3%; and Course D, 
30.3%. 

In Experiment I, Course B was the more 
difficult of the two courses used, while in Ex- 
periment II, Course B was the easiest of the 
three presented. Performance on compen- 
satory tracking with Course B did not differ 
significantly between Experiments I and II 
(t = 0.66, p = .50). However, with pursuit, 
error scores for Experiment II were signifi- 
cantly lower than those for Experiment I 
(¢ = 3.45, p< 01). 

In a previous study (1) in which pursuit 
and compensatory displays were compared on 
a course similar to Course B of the present 
experiments, but using position control, pur- 
suit was found to be 39.7% more accurate 
than compensatory. In the present study, 
Course B showed a 17.1% improvement of 


pursuit over compensatory in Experiment I, 
while in Experiment II, this course resulted 
in a 37.7% improvement favoring pursuit. 
The increase in accuracy of pursuit over com- 
pensatory with Courses C and D were ap- 
proximately that found with Course B in Ex- 
periment II. The above evidence indicates 
that the improvement of pursuit over com- 
pensatory in the present experiments, while 
not as great, approaches that found in the 
comparison between pursuit and compensa- 
tory displays using position control. 


Discussion 


The results of the present experiments in- 
dicate that, except for the easiest course, 
pursuit tracking is superior to compensatory 
in a simulated one-coordinate airplane-type 
control task. The fact that no difference be- 
tween the two tracking modes was found with 
the course containing the lowest frequencies, 
whereas the other courses showed a percent- 
age difference of equal amount, calls for fur- 
ther research with target-course frequencies 
between the lowermost two employed in the 
present study. Such a study should also help 
resolve the significant difference found with 
pursuit, Course B, between the first and 
second experiments. 

The predominant superiority of the pursuit 
mode over the compensatory gives clear evi- 
dence that the separate display of target- 
course input, control-system output, and 
error are beneficial in a “loose” control ar- 
rangement. Previous studies have shown 
that this is also true in a “tight,” but un- 
aided, tracking system employing position 
control, but that the superiority of the one 
mode over the other disappears with appro- 
priate aiding (2). 

Although the results might be taken to in- 
dicate that the precision of control of an air- 
craft would be enhanced by adopting a pur- 
suit-type rather than a more conventional 
compensatory display, considerable caution 
is called for in this regard. In the first place, 
it is well to recognize the fact that both of 
the present experiments were performed with 
an aircraft simulated only very approxi- 
mately, in a single dimension, and with no 
attempt to produce realistic pilot accelera- 
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tions. Secondly, as Birmingham and Taylor 
point out (1), pursuit displays are intrinsi- 
cally limited to lower sensitivities than com- 
pensatory indicators, and one is, therefore, 
forced to choose between precision resulting 
from high display gain and the benefit ex- 
pected from changing the nature of the dis- 
play. Finally, engineering feasibility, at least 
at present, favors the choice of the com- 
pensatory mode of indication. 


Summary 


This study is concerned with comparing the 
relative effectiveness of a pursuit and a com- 
pensatory tracking display in a simulated 
one-coordinate aircraft control loop. 

Two experiments were run, differing only in 
the ranges of the course frequencies compris- 
ing the tracking problem. In each experi- 
ment five Ss were given six one-minute trials 
daily for 12 days, using both the pursuit and 
compensatory displays for each of the track- 
ing courses. Each course was composed of 
three sine waves, and was tracked by means 
of a spring-centered joy stick control. The 
following results were obtained. 

1. With the slowest course there was no 
significant difference in error scores between 
compensatory and pursuit tracking. 

2. With the other three courses, which con- 
tained frequencies three, six, and nine times 
that of the slowest course, pursuit was sig- 
nificantly more accurate than compensatory. 


The absolute difference in favor of pursuit 
increased as the course difficulty level in- 
creased. However, the relative difference be- 
tween the two displays remained constant for 
all but the easiest course. 

It is concluded that the superiority of the 
pursuit mode over the compensatory gives 
clear evidence that the separate display of 
target-course input, control-system output, 
and error are essentially as beneficial in a 
“loose” control arrangement as in a “tight” 
but unaided tracking system employing po- 
sition control. However, other considera- 
tions concerning the choice of pursuit or com- 
pensatory displays for aircraft control systems 
suggest considerable caution in the applica- 
tion of these results. 


Received March 7, 1955. 
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Previous studies of linear interpolation by 
Backstrom (1), Miller (2), and Levett (3) 
have indicated that when a person is re- 
quired to estimate tenths of an interval he 
tends to exhibit a personal subjective scale. 
Only rarely does this subjective scale corre- 
spond to objective tenths, even in the most 
consistent performers. Superimposed on each 
subjective scale are variable errors in reading 
or setting, which differ in magnitude from 
person to person. 

What effect will brief training have in re- 
ducing the biases of subjective scales and in 
reducing variability? 


Apparatus and Procedure 


The apparatus was the same as that used by 
Levett (3). A movable marker can be positioned 
between two fixed markers 10 mm. apart by means 
of a screw control. The position of the movable 
marker is magnified by a light lever onto a scale 
visible only to E. 


PRETEST 
2 3 


2 


Fic. 1. 


TRAINING 


53 


Pretest. Twenty-seven students of varying back- 
ground acted as Ss. At the first session each S 
made settings at what he estimated to be one-tenth, 
two-tenths . . . to nine-tenths in a randomized or- 
der assigned by E, until he had completed 135 set- 
tings, 15 at each estimated tenth. No knowledge of 
results was given. At the second session, the same 
procedure was repeated. 

Training. Three training sessions of 135 settings 
each were carried out for each S within a two-week 
period. After each individual setting, S was told 
the numerical value in hundredths and then the cor- 
rect position was demonstrated by E. Otherwise, 
the procedure was the same as for the pretest. 

Posttest. After a lapse of time, varying between 
32 and 81 days, each S was recalled for a single ses- 
sion of 135 settings without knowledge of results. 


Results 


Variability was only slightly affected by 
training with knowledge of results. Simple 
practice without knowledge reduced the root- 
mean square of all standard deviations from 
.020 in the first pretest session to .016 in the 


POSTTEST 


3 4 2 3 


Mean settings. 





54 Rudolph G. Schubert and William Leroy Jenkins 


second. For the three training sessions the 
corresponding figures were .015, .012, and 
.013; and for the posttest, .014. 

Subjective scales, however, were markedly 
affected by training. Figure 1 shows graphi- 
cally the mean settings for one-tenth through 
five-tenths for the second pretest, the third 
training session, and the posttest. (The biases 
of the second pretest correlate .80 with those 
of the first pretest.) 

After brief training nearly all Ss show sub- 
jective scales that closely correspond to ob- 
jective tenths. However, the effect is only 
temporary. The posttest shows a return to 
biased subjective scales but with a few of 
the more extreme deviations eliminated. The 


correlation between residual bias and elapsed 
time (32-81 days) is only .09. Whether the 
biases could be permanently eliminated by 
prolonged training remains an open question. 
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A few articles concerning the attitudes of 
younger toward older workers have appeared 
in the literature. Tuckman and Lorge (5), 
for instance, found older workers reporting 
being made sport of by younger ones and 
feeling that younger workers wanted them to 
step down so that they might have a chance 
at promotion. Kirchner, Lindbom, and Pater- 
son (2), using a scale of attitudes toward 
older workers, found significantly and _ in- 
creasingly favorable attitudes toward older 
workers with an increase in age of respondent 
groups. Rich (4), through sociometric analy- 
sis, found clique alignments of “Oldtimers” 
vs. “Newcomers.” Following a series of train- 
ing programs the “Newcomers” tended to 
break ranks and choose as friends members 
of the “Oldtimer” group. The latter, how- 
ever, tended to maintain their clique group in 
their own voting. 

Some of the attitudes found in these stud- 
ies may be expressed in important, overt be- 
havior. One way of investigating the matter 
is through the merit ratings assigned younger 
and older workers by raters of various ages. 
This becomes an important behavior espe- 
cially where the ratings are to be used for ad- 
ministrative action such as promotions, wage 
and salary increases, etc. 


Procedure 


Such an investigation was undertaken with data 
from a ranking system in use in a large manufactur- 
ing concern. This system was a nomination pro- 
cedure whereby all higher level personnel ranked 
technical and supervisory personnel below their job 
level in terms of an over-all judgment. The rank- 
ings cut across divisional lines, ie., judgments were 
made on any man whose performance was known 
to the ranker regardless of whether or not he was 
supervised by the ranker. In this way, the system 
provided a great number of individual judgments 


1 The study was supported in part by a research 
fellowship awarded the writer by the Ohio State 
Development Fund for the study of age in relation 
to employment. 


made by nominators of various ages upon nominees 
of various ages. 

The nomination rankings were obtained on 402 
supervisors, administrators, and technical personnel 
of the executive payroll of the company. There 
were 103 higher level personnel to judge these men. 
Of the 103 nominators, however, 89 were included 
among the 402 men rated. Only 14 of the nomi- 
nators, in other words, were at such a level as to 
have no one over them to make judgments upon 
them. The 103 nominators ranked, on the average, 
46.5 men (SD = 30.82). When the rankings were 
obtained, they were converted to standard scores 
based on the rank assigned in relation to the num- 
ber of persons ranked. 

In analyzing these data, no attempt was made to 
get either an average for a ranking person or a 
ranked person. Thus we are here concerned with 
the individual judgments made by a nominator of 
a certain age upon a nominee of a certain age. 


Results 


Specifically, four separate age groups of 
nominators were studied, as shown in Table 1. 
For each of these nominator age groups the 
correlation was obtained between the age of 
the nominees and the standard scores assigned 
them.? Under this design, if older nomina- 
tors are favoring older nominees, the behav- 
ior should be reflected in decreasing negative 
correlations or increasing positive correlations 
as one ascends the nominator age columns. 

The last two columns of Table 1 are con- 
cerned with this relationship. The epsilon 
column * is the one with which we should 
be concerned in that, for all four nominator 
groups, relationships are significantly curvi- 
linear at the .01 level. But for both the 
product-moment coefficients and the epsilons 
no consistent direction of relationship is 


* Consider, for example, the nominator age group 
25-34 years. A scattergram was constructed for the 
variables age of nominee (ranging from 25-64 years) 
and standard score. Similar scattergrams were then 
constructed for the remaining nominator age groups. 

3 Epsilon, a measure of relationship in nonlinear 
data, is explained by Peters and Van Voorhis (3, pp. 
319-324). 
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Table 1 


Relationship of Age of Nominator to Scores Assigned 
Nominees of Various Ages 








Nomi- 

nator Number 
Age of 

Group Rankings M SD r 
55-64 1,380 43.7 10.30 — .08 

45-54 1,902 43.2 10.20 —.10 

35-44 3,557 41.6 10.10 


—.14 
25-34 702 41.2. 10.05 —.08 


Age of Nominees 








* Signs are not usually attached to epsilons. They are here 
shown, however, to give the reader a picture of the general 
trend. 


found among the four groups. When the 
epsilons were submitted to a chi-square test * 
the resulting value was found to stand be- 
tween the .50 and .30 levels. Thus all of the 
epsilon values may be inferred to have arisen 
from a common population. The data, there- 
fore, do not demonstrate any reliable rela- 
tionship between the age of the nominator 
and the scores he assigns nominees of various 
ages. Of even greater importance, for our 
purposes, however, is the lack of consistent 


trend of the relationships, thus refuting the 
above mentioned hypotheses. 


#Which is exact for product-moment coefficients 
(1, p. 135) but in this context is only an approxi- 
mate significance test. 


Summary and Implications 


By correlation analysis of the nomination 
scores assigned to 402 supervisory employees 
of various ages by higher level nominators of 
various ages, the possibility of age-on-age bias 
in an operating situation has been investi- 
gated. Although other investigators have 
found biases of attitudes of one age group 
toward another, operationally, bias as here 
defined was not found. Thus correlations 
between age of those nominated and nomina- 
tion scores did not differ significantly or sys- 
tematically among four nominator age groups. 
For these data, no “battle of the ages” has 
appeared to add to the usual rating difficul- 
ties. 
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Trade School Norms for Some Commonly Used Tests 
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In counseling and selection programs, local 
norms, both geographical and for particular 
schools and institutions, are desirable. The 
importance of developing such norms, and of 
making them available, are well brought out 
in the Psychological Corporation’s Test Serv- 
ice Bulletin for May, 1950, in an article en- 
titled ““Norms Must Be Relevant” (7). 


Procedure 


As part of a larger study of students enrolling in 
a large industrial training school, data on certain 
commonly used tests were obtained for over 1,000 
students. The tests used included the Bennett Me- 
chanical Comprehension Test AA, the Revised Min- 
nesota Paper Form Board MA, and the Army 
General Classification Test AM. The results are 
presented in the form of percentile norms to add to 
the data on these tests. The Kuder Preference Rec- 
ord, Form BM, was also administered to some of 
the students but the results are not reported here. 

The institution. The school from which the sam- 
ple comes is a large, privately endowed, nonprofit 
school in a large city in the Middle West. Students 
are drawn from the entire country, though most are 
from the Upper Midwest, primarily the State of 
Minnesota. 

The school offers 20 different courses, 15 of which 
were included in the present study. The five courses 
not included are nonmechanical in content, or en- 
rolled very few students during the period of the 
study. 

No specific requirements for admission to the school 
are set up, except age. Applicants must be 16 years 
of age or over. This age requirement and state laws 
regarding compulsory school attendance insure that 
practically all students have completed the eighth 
grade. A few students are denied admission because 
of a poor school record, and a few are discouraged 
on the basis of poor mathematics grades or dislike 


1The authors wish to acknowledge the assistance 
of the following in the administration of the tests: 
Arthur D. Bradley, Rome Matthews, Charles New- 
strom, and Paul H. Schwankl. Dr. Patterson is re- 
sponsible for the statistical analysis and the writ- 
ing of the present report. 

2 Although this article has been approved for pub- 
lication by the Veterans Administration, the conclu- 
sions reached are those of the authors and do not 
necessarily reflect the position of the Veterans Ad- 
ministration. 


for mathematics. No specific criteria are applied, 
however, and very few students are prevented from 
enrolling. There is thus very little selection of stu- 
dents. Self-selection does occur on the basis of in- 
terest and motivation, of course. 

The sample. New students are admitted to the 
school each month during the school year. About 
half of the new students for a given year enroll dur- 
ing the first three months of the year, however. The 
present sample consists of students enrolling in the 
courses listed during the 1953-54 school year, and 
the first three months of the 1954-55 school year. 
It was not possible to test every student who en- 


Table 1 
Ages of 1,011 Students in Trade School Courses 








Age 
17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 45 
29 46 
30 47 
31 48 1 

32 Mean 22.37 SD4.14 
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tered, however, for various reasons. About 85% of 
the new students enrolling for the period studied 
were tested. 

Tables 1 and 2 give the distributions and summary 
statistics for age and education of the group tested. 

Conditions of testing. The tests were adminis- 
tered to the students in groups of from 20 to 42. 
The instructions accompanying the tests were used. 
Fifteen minutes was allowed for practice problems 
on the AGCT. This time was arrived at on the 
basis of trials with small groups of veterans tested 
at a VA guidance center. If a group being tested 
completed the practice problems in jess than 15 
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Table 2 


Education of 1,011 Students in ‘Trade School Courses 








Highest 
Grade 
Completed 


Mean 11.54 SD 1.36 





minutes, the test was started. Every student was 
given an opportunity to complete the Kuder and 
the Bennett, if necessary after the other tests had 
been completed. 

Although all students had been admitted to the 
school, motivation was sought by telling the stu- 
dents that the results might be important to them 
at a later date; if their work were poor and their 
test scores high, they might be allowed to show that 
they could improve their work. 


Results 


Table 3 shows the means and standard 
deviations for the AGCT, the Bennett, and 
the RMPFB for the 1,011 students included 
in the sample. Percentile norms for the three 
tests are given in Table 4. 

These results may be compared with the 
norms for various groups given in the manuals 
for the tests. The AGCT score correspond- 
ing to the raw score mean of the present 
sample is 116. It is apparent that this is a 
superior group in general intelligence. The 
SD of 16.2 corresponds to an SD of 13.0 


Table 3 


Means and Standard Deviations for 1,011 Trade 
School Students on the AGCT, the 
Bennett, and the RMPFB 








Test 
AGCT (raw score) 
Bennett 
RMPFB 


Mean SD 


16.23 
8.20 
8.74 


94.73. 
43.06 
44.88 





Table 4 


Percentile Norms for 1,011 Trade School Students on 
the AGCT, the Bennett, and the RMPFB 





RMPFB 


Score 


AGCT Raw 
Score 
125 
119 
114 
110 
108 
105 
103 
101 
100 
98 
96 
94 
92 
90 
87 
85 
82 


Bennett 
Percentile Score 


61 
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AGCT points, which indicates that the pres- 
ent group is less variable than the general 
population. 

On the Bennett, the mean of the present 
sample is slightly higher than the mean of 
candidates for technical courses as given in 
the manual, and higher than the means of any 
other group except candidates for engineering 
positions, and engineering freshmen. The vari- 
ability of the present group appears to be 
slightly less than that of the groups for whom 
data are given in the manual, except candi- 
dates for engineering positions and engineer- 
ing freshmen. Allowing for the slight differ- 
ences in mean and variability, the norms are 
quite close to those for candidates for tech- 
nical training. The present group is slightly 
younger and has less education than the 
manual norm group, however. 

On the RMPFB, the present group scores 
higher than any of the norm groups given in 
the manual except engineering freshmen at 
the University of Minnesota and at Illinois 
Institute of Technology. Variability is less 
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than for the manual norm groups except those 
mentioned above, compared to which the 
present sample is more variable. The per- 
centile norms approach those for engineering 
freshmen at Illinois Institute of Technology. 
However, as noted in the manual, the scores 
for the Illinois Institute of Technology sam- 
ple should be raised one or two points, since 
the scoring formula used included a correc- 
tion for guessing. 

An analysis of variance of the course means 
on the three tests was done. The 706 cases 
tested during the 1953-1954 year were used 
in this analysis. In each case F was signifi- 
cant beyond the .01 level, indicating that the 
variations in the means among the 15 courses 
are significantly greater than chance. The 
actual differences among the means are rela- 
tively small, however. The differences be- 
tween the highest and lowest means are 13 
points for the AGCT, 7 points for the Ben- 
nett, and 11.5 points for the RMPFB. It is 
of interest to mention, however, that the rank 
orders of the courses by test means correlated 
significantly with the rankings of the courses 
by difficulty level by two judges. The rank 
correlations were .77 and .80 for the AGCT, 
.59 and .69 for the Bennett, and .72 and .70 
for the RMPFB. The rankings of the judges 
correlated .93. 

It would appear that the problem of dif- 
ferences among courses warrants further con- 
sideration. Further analysis is planned, us- 
ing all 1,011 cases. The ordinary ¢ test of 
differences between pairs of means is not ap- 
propriate following the obtaining of a signifi- 
cant F in the analysis of variance, and ap- 
propriate procedures are just being worked 
out for this type of problem (2). 


A note on reliability. About 60% of the 
students in the present sample were veterans. 
Since veterans are eligible for vocational coun- 
seling from the Veterans Administration, it 
was to be expected that some of them had 
taken advantage of this opportunity and in 
the process of counseling had taken one or 
more of the tests included in this study. 
Scores on these tests were obtained from the 
VA records for each veteran tested during the 
1953-1954 school year. These data provided 
material for test-retest correlations. Esti- 
mates of reliability and of whether a practice 
effect was present in the above norms were 
thus possible. 

The data are given in Table 5. The cor- 
relation of .88 for the AGCT is close to the 
equivalent forms correlation of .92 reported 
in the manual, and higher than the test-retest 
reliability of .82 reported in the same source 
(6). On the basis of a test-retest gain of 1.3 
points, the manual suggests that the AGCT 
may be used for retesting without concern for 
practice effect. The gain in the present study 
is greater, and is statistically significant. 

The reliability of the Bennett is surpris- 
ingly low. A search for other reliability co- 
efficients led to the finding that they are 
practically nonexistent. Super (9) reports 
that the only coefficient he could locate was 
the split-half reliability of .84 for 500 ninth- 
grade boys, as reported in the manual. The 
variability of the present sample is small, 
which may be a factor in the low correlation 
obtained. No significant practice effect is 
present. 

The manual for the RMPFB comments on 
the “paucity of published reliability data for 
the MPFB Test” (4, p. 8). Quasha and 


Table 5 


Test-Retest Comparisons and Correlations for the AGCT, the Bennett, and the RMPFB 








Interval in 
Test- Months 

Retest 
Test N 


Educa- 
Age tion 
Mean Mean 


Mean* Range Mean, 


Mean: 





AGCT 59 
Bennett 86 
RMPFB_ 68 


10.9 0-66 
11.9 0-88 
13.1 0-88 


25.1 11.3 
248 11.3 
24.6 11.3 


92.39 
42.74 
44.22 





96.81 
43.52 
47.18 





* Over 75% of the intervals were one year or less. 
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Likert (5) report a correlation of .80 for 
Series A and B, and .85 fox AA and BB. 
Stephens (8) found a test-retest correlation 
of .85 for the machine-scored series AA. 
Ebert and Simmons (3) give test-retest cor- 
relations for 10 to 14-year-old children, 
tested 1 to 4 years apart. For 87 children 
tested at age 13 and at age 14, including un- 
specified proportions of both sexes, the cor- 
relation was .79. The present coefficient of 
.71 for Form MA is thus somewhat lower 
than those reported for earlier forms. The 
practice effect is significant. 

In comparing these coefficients with other 
test-retest correlations, the interval between 
tests must be considered. In the present case 
this was not constant, but varied from less 
than a month to over seven years. The ma- 
jority were 12 months or less, however. The 
groups for whom test-retest data were avail- 
able are closely similar to the total sample 
in test scores and education, but somewhat 
older, which is accounted for by the fact that 
they are all veterans. Although there appear 
to be significant practice effects on the AGCT 
and the RMPFB, the actual gain is relatively 
small. This, coupled with the fact that less 
than 10% of the total sample had taken the 
tests previously, should not affect the norma- 
tive data significantly. It is likely that ap- 
proximately the same proportion of applicants 
to other schools of this type have previously 
taken such tests prior to entrance. 


Summary 


Normative data on the AGCT, Bennett, 
and RMPFB tests are reported for 1,011 


students enrolling in a large, privately en- 
dowed industrial institute. The norms are 
compared with those for other groups given 
in the manuals for the tests. Some data on 
test-retest reliabilities are also given. These 
data are part of a larger study of the predic- 
tion of success in trade school courses by 
means of tests and background factors. 
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The versatility of the chi-square test of 
significance has been responsible for a con- 
stantly increasing application of this tech- 
nique in a wide variety of industrial research 
situations. One of the most common uses to 
which it lends itself is the “preference” ex- 
periment, in which a sample of subjects is 
asked to indicate which of two objects is pre- 
ferred. In such a situation, a chance hy- 
pothesis predicts that a 50-50 split in prefer- 
ences will be obtained; chi square provides a 
test of the significance of the difference, if 
any, between the observed frequencies and 
this chance prediction. 

The question often is raised whether an ad- 
vance estimate can be made by the experi- 
menter of the number of subjects (Ss) re- 
quired for a definitive result. This paper sug- 
gests a plan by which some control over the 
number of Ss may be effected during the 
course of data collection as a result of formu- 
lating the problem in terms of chi square. It 
should be noted that the method of sequential 
t tests (of the difference between proportions) , 
described by Wald (2), while generally ap- 
plicable to this problem, involves considera- 
tion of both alpha and beta hypotheses and 
demands a degree of statistical sophistication 
not required by the considerations in this 
paper. 

In a two-celled table, as is obtained in pref- 
erence experiments, the general form of the 
equation of chi square, 


, (fe— fy 
chi square = }> (‘454*) _ 
may be reduced to a more simple form. It 
should be noted that the number of degrees 
of freedom in this case is one less than the 
number of categories, or 1. This is because 


1 The writer wishes to express his appreciation to 
Drs. C. L. Wilson and R. R. MacKie for their sug- 
gestions during the preparation of the original manu- 
script. 


one restriction has been placed upon the 
table: the observed value of N (1). 

Substituting the letter, d, for the expres- 
sion f, — f, (the difference between observed 
and expected frequencies), in the numerator 
of Equation 1, we get 


& ’ 

chi square = >> (<) (2) 

Since the theoretical frequency is taken to 

be 50% of N, we may substitute N/2 for f, 
in the denominator of the right-hand term. 


(3) 


In addition, we know that the sum of 
the cell-square contingencies actually equals 
twice either one of these values. This is due 
to the fact that the differences between N /2 
and each observed frequency will differ only 
in sign. We may then write Equation 3 as 


follows: 
a 
chi square = 2 (5) , (4) 


This becomes 


, 40° . 
chi square = T° (5) 


It is now possible to substitute the value of 
chi square which is significant at the desired 
level of confidence and solve Equation 5 for 
N ord. (In order to be significant at the .05 


level of confidence, chi square with one de- 
gree of freedom must equal 3.841.) 


3.841 = 


N? 
4 2 
~ 3.841’ 
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Fic. 1. Relationship between number of Ss and required difference between observed 
and expected frequencies. 


The relationship between the required dif- 
ference (d) and size of sample (NV) may be 
plotted graphically for any desired level of 


confidence. Since the .01 and .05 levels are 
most frequently required by psychological ex- 
perimenters, the graph in Fig. 1 indicates the 
required values of d for various N’s at both 
significance levels. 

Using this graph, the experimenter may 
plan his research so that data may be col- 
lected in phases, with 10, 20, or 30 Ss at a 
time. After each period of data collection, 
he may determine from the graph whether 
the difference obtained between his observed 
frequency and that expected by chance is 
great enough to exceed the desired level of 
confidence. If it is not, he may collect addi- 
tional data. A limit would have to be deter- 
mined beyond which the experiment would 
not continue if the obtained frequency failed 
to differ significantly from the expected fre- 


quency. This limit would probably be deter- 
mined on the basis of economic or other prac- 
tical considerations. 

It is recognized that this graphical pro- 
cedure involves no new statistical concepts. 
Many similar devices have been employed in 
the past. Its main value lies in the ease with 
which it lends itself to the planning and ad- 
ministration of two-celled data collection pro- 
cedures and the ultimate saving of costs in 
those cases where significant differences are 
achieved before some more or less arbitrary 
number of observations has been made. 
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Biserial r is widely used for estimating the 
correlation between twofold ratings and test 
scores. Its first cousin, triserial r, seems to 
have been very rarely employed, although 
threefold ratings (high, middle, low) seem 
more natural to many people than twofold 
ratings. The purpose of this article is to 
point out that triserial ry ordinarily is easy to 
compute and has a lower standard error than 
biserial r. 

In 1946 Jaspen (1) offered the following 
formula for triserial r: 


M Vr + Manlvi— vi) — Miyi 





vy  (n— wy . 
P id 
| Pr Pm pr 


in which M stands for mean, p for propor- 
tion, y for normal curve ordinate, and o for 
the standard deviation of all the scores. The 
subscripts #, m, and / signify the high, mid- 
dle, and low groups. 

If the high and low groups are equal, how- 
ever, Jaspen’s forbidding formula reduces to 
the following simple expression: 


M,— Mi p 
£. 


o 


tris = 


Because Jaspen gave no mathematical for- 
mula for the standard error of triserial 7, an 
empirical study was undertaken as follows: 
From a population having a known product- 
moment r, 100 samples of 100 pairs of scores 
were selected at random. In each sample, 
one set of scores was reduced to threefold 
ratings using proportions of .05, .06, etc. up 
to .50 for equal high and low groups. Tri- 
serial ry was computed for each sample, giving 
an empirical sampling distribution of 100 tri- 
serial r’s for each proportion. In a similar 
fashion sampling distributions of biserial r’s 
were computed. The process was carried out 
with three populations having product-moment 
r’s of .28, .66, and .91. 


Figure 1 shows the results graphically. 
The open circles indicate empirical standard 
deviations of the sampling distributions) for 
biserial r’s. The dotted curves follow Soper’s 








Fic. 1. Empirical standard errors of biserial and 


triserial r’s. 
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formula (2) for the standard error of bi- 
serial r: 


SEpis = | Sam . 

v\ 
The solid circles indicate empirical standard 
errors for triserial r’s. The solid curves are 
drawn according to a suggested formula for 
the standard error of triserial r from equal 
tails: 


Niq/y — 7° — D+ p 
VV 





SE tris = 


where p= the proportion in one tail and 
q=1-p. 

From Fig. 1 it is evident that the empirical 
standard errors of triserial r are consistently 
lower than the standard errors of biserial r, 
reaching a minimum when ? is about .25. 
As long as p is .10 or greater, the standard 
error of any triserial r is less than the stand- 
ard error of the best biserial ry. Further, it 
appears that the empirical data for triserial r 
fit the suggested formula better than the em- 
pirical data for biserial r fit Soper’s formula. 

Another empirical study was undertaken to 
determine the effect of using the simple for- 
mula for triserial r when the high and low 
groups are not exactly equal, the mean pro- 
portion in the tails being used for p and to 
determine y. Table 1 shows in parallel col- 
umns the results with the full Jaspen formula 
and the simple formula for three sets of un- 
equal extremes: .30/.20, .35/.15, and .40/.10. 
With extremes of .30/.20, no bias and no in- 
crease in standard error are introduced by 
using the simple formula. It seems safe to 
say that the simple formula can be safely 
used when the high and low groups do not 
differ by more than .10, but with greater in- 


Table 1 


Triserial r’s with Unequal Extreme Proportions 








Standard 

Means Errors 
Extreme — 
Proportions Jaspen Simple 
.30/.20 265 .262 108 
654 .667 .070 


897 908 OK .033 


Jaspen Simple 


.253 250 : 118 
.646 .674 j 074 
893 929 040 


.243 .252 105 135 
652 .703 099 
895 .968 065 


equality it is better to employ the full Jaspen 
formula. A worked example follows: 


EXAMPLE 
Middle 


48 57 45 58 53 62 
54 67 59 47 67 62 


Low 
48 55 62 
40 43 57 
M ,= 50.83 
pi=.24 


‘ 24 
p = mean of p, and p; = ——~—— = .26 


High 
83 71 55 75 
62 80 67 
M,=70.43 


pr= .28 o= 10.85 


70.43 — 50.83 
10.85 ‘2 3244 — 


‘tris = 
Received March 21, 1955. 
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